POKI_PUT_TOC_HERE
Whereas the Unix toolkit is made of the separate executables cat
, tail
, cut
,
sort
, etc., Miller has subcommands, invoked as follows:
POKI_INCLUDE_ESCAPED(data/subcommand-example.txt)HERE
Commands | Description |
---|---|
cat ,
cut ,
grep ,
head ,
join ,
sort ,
tac ,
tail ,
top ,
uniq
|
Analogs of their Unix-toolkit namesakes, discussed below as well as in POKI_PUT_LINK_FOR_PAGE(feature-comparison.html)HERE |
filter ,
put ,
sec2gmt ,
sec2gmtdate ,
step ,
tee
|
awk -like functionality |
bar ,
bootstrap ,
decimate ,
histogram ,
least-frequent ,
most-frequent ,
sample ,
shuffle ,
stats1 ,
stats2
|
Statistically oriented |
group-by ,
group-like ,
having-fields
|
Particularly oriented toward POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE, although all Miller commands can handle heterogeneous records |
check ,
count-distinct ,
label ,
merge-fields ,
nest ,
nothing ,
rename ,
rename ,
reorder ,
reshape ,
seqgen
|
These draw from other sources (see also POKI_PUT_LINK_FOR_PAGE(originality.html)HERE):
count-distinct is SQL-ish, and
rename can be done by sed (which does it faster:
see POKI_PUT_LINK_FOR_PAGE(performance.html)HERE).
|
--dkvp --idkvp --odkvp --nidx --inidx --onidx --csv --icsv --ocsv --csvlite --icsvlite --ocsvlite --pprint --ipprint --opprint --right --xtab --ixtab --oxtab --json --ijson --ojsonThese are as discussed in POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE, with the exception of
--right
which makes pretty-printed output right-aligned:
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE | POKI_RUN_COMMAND{{mlr --opprint --right cat data/small}}HERE |
--csv
, --pprint
, etc. when the input and output formats are the same.
Use --icsv --opprint
, etc. when you want format conversion as part of what Miller does to your data.
DKVP (key-value-pair) format is the default for input and output. So,
--oxtab
is the same as --idkvp --oxtab
.
--oformat2
clobber all the output-related effects of
--format1
also removes some flexibility from the command-line
interface. See also
https://github.com/johnkerl/miller/issues/180 and
https://github.com/johnkerl/miller/issues/199.
mlr -I
flag to process files in-place. For example,
mlr -I --csv cut -x -f unwanted_column_name mydata/*.csv
will remove
unwanted_column_name
from all your *.csv
files in your
mydata/
subdirectory.
By default, Miller output goes to the screen (or you can redirect a file
using >
or to another process using |
). With -I
,
for each file name on the command line, output is written to a temporary file
in the same directory. Miller writes its output into that temp file, which is
then renamed over the original. Then, processing continues on the next file.
Each file is processed in isolation: if the output format is CSV, CSV headers
will be present in each output file; statistics are only over each file's own
records; and so on.
Please see here
for examples.
--prepipe {command}The prepipe command is anything which reads from standard input and produces data acceptable to Miller. Nominally this allows you to use whichever decompression utilities you have installed on your system, on a per-file basis. If the command has flags, quote them: e.g.
mlr --prepipe 'zcat -cf'
. Examples:
# These two produce the same output: $ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime $ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz # With multiple input files you need --prepipe: $ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz $ mlr --prepipe gunzip --idkvp --oxtab cut -f hostname,uptime myfile1.dat.gz myfile2.dat.gz # Similar to the above, but with compressed output as well as input: $ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | gzip > outfile.csv.gz $ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz | gzip > outfile.csv.gz $ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz | gzip > outfile.csv.gz # Similar to the above, but with different compression tools for input and output: $ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | xz -z > outfile.csv.xz $ xz -cd < myfile1.csv.xz | mlr cut -f hostname,uptime | gzip > outfile.csv.xz $ mlr --prepipe 'xz -cd' cut -f hostname,uptime myfile1.csv.xz myfile2.csv.xz | xz -z > outfile.csv.xz ... etc.
IRS
and ORS
, field
separators IFS
and OFS
, and pair separators IPS
and
OPS
. For example, in the DKVP line a=1,b=2,c=3
, the record
separator is newline, field separator is comma, and pair separator is the
equals sign. These are the default values.
Options:
--rs --irs --ors --fs --ifs --ofs --repifs --ps --ips --ops
--ifs =
--ofs :
. Or, you can specify that the same separator is to be used for
input and output via e.g. --fs :
.
The pair separator is only relevant to DKVP format.
Pretty-print and xtab formats ignore the separator arguments altogether.
The --repifs
means that multiple successive occurrences of the
field separator count as one. For example, in CSV data we often signify nulls
by empty strings, e.g. 2,9,,,,,6,5,4
. On the other hand, if the field
separator is a space, it might be more natural to parse 2 4 5
the
same as 2 4 5
: --repifs --ifs ' '
lets this happen. In fact,
the --ipprint
option above is internally implemented in terms of
--repifs
.
Just write out the desired separator, e.g. --ofs '|'
. But you
may use the symbolic names newline
, space
, tab
,
pipe
, or semicolon
if you like.
--ofmt {format string}
is the global
number format for commands which generate numeric output, e.g.
stats1
, stats2
, histogram
, and step
, as
well as mlr put
. Examples:
POKI_CARDIFY(--ofmt %.9le --ofmt %.6lf --ofmt %.0lf)HERE
These are just C printf
formats applied to double-precision
numbers. Please don’t use %s
or %d
. Additionally, if
you use leading width (e.g. %18.12lf
) then the output will contain
embedded whitespace, which may not be what you want if you pipe the output to
something else, particularly CSV. I use Miller’s pretty-print format
(mlr --opprint
) to column-align numerical data.
To apply formatting to a single field, overriding the global
ofmt
, use fmtnum
function within mlr put
. For example:
POKI_RUN_COMMAND{{echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'}}HERE
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'}}HERE
Input conversion from hexadecimal is done automatically on fields handled
by mlr put
and mlr filter
as long as the field value begins
with "0x". To apply output conversion to hexadecimal on a single column, you
may use fmtnum
, or the keystroke-saving hexfmt
function.
Example:
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'}}HERE
then
keyword:
POKI_CARDIFY(mlr cut --complement -f os_version then sort -f hostname,uptime *.dat)HERE
(You can precede the very first verb with then
, if you like, for symmetry.)
Here’s a performance comparison:
POKI_INCLUDE_ESCAPED(data/then-chaining-performance.txt)HERE
There are two reasons to use then-chaining: one is for performance, although I
don’t expect this to be a win in all cases. Using then-chaining avoids
redundant string-parsing and string-formatting at each pipeline step: instead
input records are parsed once, they are fed through each pipeline stage in
memory, and then output records are formatted once. On the other hand, Miller
is single-threaded, while modern systems are usually multi-processor, and when
streaming-data programs operate through pipes, each one can use a CPU. Rest
assured you get the same results either way.
The other reason to use then-chaining is for simplicity: you don’t
have re-type formatting flags (e.g. --csv --fs tab
) at every
pipeline stage.
mlr sort -n
, mlr sort -nr
.
Statistics: mlr histogram
, mlr stats1
, mlr stats2
.
Cross-record arithmetic: mlr step
.
mlr put
and mlr filter
:
x=1,y=2
then
mlr put '$z=$x+$y'
will produce x=1,y=2,z=3
, and mlr put
'$z=$x.$y'
does not give an error simply because the dot operator has been
generalized to stringify non-strings. To coerce back to string for processing,
use the string
function: mlr put '$z=string($x).string($y)'
will produce x=1,y=2,z=12
.
On input, string values representable as boolean (e.g. "true"
,
"false"
) are not automatically treated as boolean. (This is
because "true"
and "false"
are ordinary words, and auto
string-to-boolean on a column consisting of words would result in some strings
mixed with some booleans.) Use the boolean
function to coerce: e.g.
giving the record x=1,y=2,w=false
to mlr put '$z=($x<$y) ||
boolean($w)'
.
Functions take types as described in mlr --help-all-functions
:
for example, log10
takes float input and produces float output,
gmt2sec
maps string to int, and sec2gmt
maps int to string.
All math functions described in mlr --help-all-functions
take
integer as well as float input.
mlr sort
: if you try to sort on field
hostname
when not all records in the data stream have a field
named hostname
, it is not an error (although you could pre-filter the
data stream using mlr having-fields --at-least hostname then sort
...
). Rather, records lacking one or more sort keys are simply output
contiguously by mlr sort
.
Miller has two kinds of null data:
x=,y=2
in the data input stream, or assignment $x=""
or @x=""
in
mlr put
.
Absent (key not present): a field name is not present, e.g. input
record is x=1,y=2
and a put
or filter
expression
refers to $z
. Or, reading an out-of-stream variable which hasn’t
been assigned a value yet, e.g. mlr put -q '@sum += $x; end{emit
@sum}'
or mlr put -q '@sum[$a][$b] += $x; end{emit @sum, "a",
"b"}'
.
is_empty
/is_not_empty
, is_absent
/is_present
, and
is_null
/is_not_null
. For the last pair, note that null means
either empty or absent.
Rules for null-handling:
min
and max
functions are
special: if one argument is non-null, it wins:
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'}}HERE
Functions of absent variables (e.g. mlr put '$y =
log10($nonesuch)'
) evaluate to absent, and arithmetic/bitwise/boolean
operators with both operands being absent evaluate to absent.
Arithmetic operators with one absent operand return the other operand.
More specifically, absent values act like zero for addition/subtraction, and
one for multiplication: Furthermore, any expression which evaluates to
absent is not stored in the left-hand side of an assignment statement :
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'}}HERE
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'}}HERE
Likewise, for assignment to maps, absent-valued keys or values result
in a skipped assignment.
mlr put '@sum += $x'
should accumulate numeric x
values into the sum but an empty
x
, when encountered in the input data stream, should make the sum
non-numeric. To work around this you can use the
is_not_null
function as follows:
mlr put 'is_not_null($x) { @sum += $x }'
Absent stream-record values should not break accumulations, since Miller
by design handles heterogenous data: the running @sum
in
mlr put '@sum += $x'
should not be invalidated for records which have no x
.
Absent out-of-stream-variable values are precisely what allow you to write
mlr put '@sum += $x'
. Otherwise you would have to write
mlr put 'begin{@sum = 0}; @sum += $x'
—
which is tolerable — but for
mlr put 'begin{...}; @sum[$a][$b] += $x'
you’d have to pre-initialize @sum
for all values of $a
and $b
in your
input data stream, which is intolerable.
The penalty for the absent feature is that misspelled variables can be hard to find:
e.g. in mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'
the accumulator is spelt @sumx
in the begin-block but @sum
in the end-block, where since it
is absent, @sum*2
evaluates to 2. See also the section on
errors and transparency.
@sum += $x
work correctly on heterogenous data,
as do within-record formulas if both operands are absent. If one operand is
present, you may get behavior you don’t desire. To work around this
— namely, to set an output field only for records which have all the
inputs present — you can use a pattern-action block with
is_present
:
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE
POKI_RUN_COMMAND{{mlr put 'is_present($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp}}HERE
POKI_RUN_COMMAND{{mlr put '$loadmillis = (is_present($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp}}HERE
If you’re interested in a formal description of how empty and absent
fields participate in arithmetic, here’s a table for plus (other
arithmetic/boolean/bitwise operators are similar):
POKI_RUN_COMMAND{{mlr --print-type-arithmetic-info}}HERE
mlr filter '$name =~ "..."'
,
mlr put '$name = $othername . "..."'
,
mlr put '$name = sub($name, "...", "...")
, etc.:
\a
: ASCII code 0x07 (alarm/bell)
\b
: ASCII code 0x08 (backspace)
\f
: ASCII code 0x0c (formfeed)
\n
: ASCII code 0x0a (LF/linefeed/newline)
\r
: ASCII code 0x0d (CR/carriage return)
\t
: ASCII code 0x09 (tab)
\v
: ASCII code 0x0b (vertical tab)
\\
: backslash
\"
: double quote
\123
: Octal 123, etc. for \000
up to \377
\x7f
: Hexadecimal 7f, etc. for \x00
up to \xff
filter
and put
:
that is, if you type \t
in a string literal for a filter
/put
expression, it will be turned into a tab character. If you want a backslash followed by a t
, then please type \\t
.
However, these replacements are not done automatically within your data stream. If you wish to make these
replacements, you can do, for example, for a field named field
, mlr put '$field = gsub($field, "\\t",
"\t")'
. If you need to make such a replacement for all fields in your data, you should probably simply use the
system sed
command.
mlr filter
with =~
or !=~
, e.g. mlr
filter '$url =~ "http.*com"'
In mlr put
with sub
or gsub
, e.g. mlr put
'$url = sub($url, "http.*com", "")'
In mlr having-fields
, e.g. mlr having-fields
--any-matching '^sda[0-9]'
In mlr cut
, e.g. mlr cut -r -f '^status$,^sda[0-9]'
In mlr rename
, e.g. mlr rename -r '^(sda[0-9]).*$,dev/\1'
In mlr grep
, e.g. mlr --csv grep 00188555487 myfiles*.csv
^
and/or $
explicitly.
Miller regexes are wrapped with double quotes rather than slashes.
The i
after the ending double quote indicates a case-insensitive
regex.
Capture groups are wrapped with (...)
rather than
\(...\)
; use \(
and \)
to match against parentheses.
filter
and put
, if the regular expression is a string
literal (the normal case), it is precompiled at process start and reused
thereafter, which is efficient. If the regular expression is a more complex
expression, including string concatenation using .
, or a column name
(in which case you can take regular expressions from input data!), then regexes
are compiled on each record which works but is less efficient. As well, in this
case there is no way to specify case-insensitive matching.
Example:
POKI_RUN_COMMAND{{cat data/regex-in-data.dat}}HERE
POKI_RUN_COMMAND{{mlr filter '$name =~ $regex' data/regex-in-data.dat}}HERE
\0
through \9
are supported as
follows: sub
and gsub
.
For example, the first \1,\2
pair belong to the first sub
and
the second \1,\2
pair belong to the second sub
:
mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'
put
for the =~
and !=~
operators. For example, here the \1,\2
are set by the
=~
operator and are used by both subsequent assignment statements:
mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'
\1,\2
won’t be expanded from the regex capture:
mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'
filter
for the =~
and
!=~
operators. For example, there is no mechanism provided to refer to
the first (..)
as \1
or to the second (....)
as
\2
in the following filter statement:
mlr filter '$a =~ "(..)_(....)'
\1
through \9
, while
\0
is the entire match string; \15
is treated as \1
followed by an unrelated 5
.
123
or 0xabcd
, is treated as
an integer; otherwise, input scannable as float (4.56
or 8e9
)
is treated as float; everything else is a string.
If you want all numbers to be treated as floats, then you may use
float()
in your filter/put expressions (e.g. replacing $c = $a *
$b
with $c = float($a) * float($b)
) — or, more simply, use
mlr filter -F
and mlr put -F
which forces all numeric input,
whether from expression literals or field values, to float. Likewise mlr
stats1 -F
and mlr step -F
force integerable accumulators (such as
count
) to be done in floating-point.
exp(0) = 1.0
rather than 1
. The
following, however, produce integer output if their inputs are integers:
+
-
*
/
//
%
abs
ceil
floor
max
min
round
roundm
sgn
. As well, stats1 -a min
, stats1 -a
max
, stats1 -a sum
, step -a delta
, and step -a
rsum
produce integer output if their inputs are integers.
64-bit integer 64-bit integer Casted to double Back to 64-bit in hex in decimal integer 0x7ffffffffffff9ff 9223372036854774271 9223372036854773760.000000 0x7ffffffffffff800 0x7ffffffffffffa00 9223372036854774272 9223372036854773760.000000 0x7ffffffffffff800 0x7ffffffffffffbff 9223372036854774783 9223372036854774784.000000 0x7ffffffffffffc00 0x7ffffffffffffc00 9223372036854774784 9223372036854774784.000000 0x7ffffffffffffc00 0x7ffffffffffffdff 9223372036854775295 9223372036854774784.000000 0x7ffffffffffffc00 0x7ffffffffffffe00 9223372036854775296 9223372036854775808.000000 0x8000000000000000 0x7ffffffffffffffe 9223372036854775806 9223372036854775808.000000 0x8000000000000000 0x7fffffffffffffff 9223372036854775807 9223372036854775808.000000 0x8000000000000000
7/2
is 3.5
.
Integer division is done with //
: 7//2
is 3
.
This rounds toward the negative.
Remainders are non-negative.