POKI_PUT_TOC_HERE
Sample CSV data file:
POKI_RUN_COMMAND{{cat example.csv}}HEREmlr cat
is like cat ...
... but it can also do format conversion (here, to pretty-printed tabular format):
POKI_RUN_COMMAND{{mlr --icsv --opprint cat example.csv}}HERE mlr head
and mlr tail
count records rather than lines. The CSV
header is included either way:
Sort primarily alphabetically on one field, then secondarily numerically descending on another field:
POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index example.csv}}HERE Use cut
to retain only specified fields, in input-data order:
Use cut -o
to retain only specified fields, in your specified order:
Use cut -x
to omit specified fields:
Use filter
to retain specified records:
Use put
to add/replace fields which are computed from other fields:
Even though Miller’s main selling point is
name-indexing, sometimes you really want to refer to a field name by its
positional index. Use $[[3]]
to access the name of field 3 or
$[[[3]]]
to access the value of field 3:
JSON output:
POKI_RUN_COMMAND{{mlr --icsv --ojson put '$ratio = $quantity/$rate; $shape = toupper($shape)' example.csv}}HEREJSON output with vertical-formatting flags:
POKI_RUN_COMMAND{{mlr --icsv --ojson --jvstack --jlistwrap tail -n 2 example.csv}}HERE Use then
to pipe commands together. Also, the
-g
option for many Miller commands is for group-by: here, head -n
1 -g shape
outputs the first record for each distinct value of the
shape
field. This means we’re finding the record with highest
index
field for each distinct shape
field:
Statistics can be computed with or without group-by field(s). Also, the first of these two
examples uses --oxtab
output format which is a nice alternative to --opprint
when you
have lots of columns:
Often we want to print output to the screen. Miller does this by default, as we’ve seen in the previous examples.
Sometimes we want to print output to another file: just use '> outputfilenamegoeshere' at the end of your command:
% mlr --icsv --opprint cat example.csv > newfile.csv # Output goes to the new file; # nothing is printed to the screen. |
% cat newfile.csv color shape flag index quantity rate yellow triangle 1 11 43.6498 9.8870 red square 1 15 79.2778 0.0130 red circle 1 16 13.8103 2.9010 red square 0 48 77.5542 7.4670 purple triangle 0 51 81.2290 8.5910 red square 0 64 77.1991 9.5310 purple triangle 0 65 80.1405 5.8240 yellow circle 1 73 63.9785 4.2370 yellow circle 1 87 63.5058 8.3350 purple square 0 91 72.3735 8.2430 |
% cp example.csv newfile.txt % cat newfile.txt color,shape,flag,index,quantity,rate yellow,triangle,1,11,43.6498,9.8870 red,square,1,15,79.2778,0.0130 red,circle,1,16,13.8103,2.9010 red,square,0,48,77.5542,7.4670 purple,triangle,0,51,81.2290,8.5910 red,square,0,64,77.1991,9.5310 purple,triangle,0,65,80.1405,5.8240 yellow,circle,1,73,63.9785,4.2370 yellow,circle,1,87,63.5058,8.3350 purple,square,0,91,72.3735,8.2430 |
% mlr -I --icsv --opprint cat newfile.txt % cat newfile.txt color shape flag index quantity rate yellow triangle 1 11 43.6498 9.8870 red square 1 15 79.2778 0.0130 red circle 1 16 13.8103 2.9010 red square 0 48 77.5542 7.4670 purple triangle 0 51 81.2290 8.5910 red square 0 64 77.1991 9.5310 purple triangle 0 65 80.1405 5.8240 yellow circle 1 73 63.9785 4.2370 yellow circle 1 87 63.5058 8.3350 purple square 0 91 72.3735 8.2430 |
mlr -I
you can bulk-operate on lots of files: e.g.
mlr -I --csv cut -x -f unwanted_column_name *.csv |
Lastly, using tee
within put
, you can split your input data into separate files
per one or more field names:
POKI_RUN_COMMAND{{cat circle.csv}}HERE | POKI_RUN_COMMAND{{cat square.csv}}HERE | POKI_RUN_COMMAND{{cat triangle.csv}}HERE |
shape,flag,index circle,1,24 square,0,36
shape=circle,flag=1,index=24 shape=square,flag=0,index=36
Data written this way are called DKVP, for delimited key-value pairs.
We’ve also already seen other ways to write the same data:CSV PPRINT JSON shape,flag,index shape flag index [ circle,1,24 circle 1 24 { square,0,36 square 0 36 "shape": "circle", "flag": 1, "index": 24 }, DKVP XTAB { shape=circle,flag=1,index=24 shape circle "shape": "square", shape=square,flag=0,index=36 flag 1 "flag": 0, index 24 "index": 36 } shape square ] flag 0 index 36
Anything we can do with CSV input data, we can do with any other format input data.
And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.mlr --tsv
or mlr --tsvlite
. This
means I can do some (or all, or none) of my data processing within SQL queries,
and some (or none, or all) of my data processing using Miller — whichever
is most convenient for my needs at the moment.
For example, using default output formatting in mysql
we get
formatting like Miller’s --opprint --barred
:
$ mysql --database=mydb -e 'show columns in mytable' +------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------------+--------------+------+-----+---------+-------+ | id | bigint(20) | NO | MUL | NULL | | | category | varchar(256) | NO | | NULL | | | is_permanent | tinyint(1) | NO | | NULL | | | assigned_to | bigint(20) | YES | | NULL | | | last_update_time | int(11) | YES | | NULL | | +------------------+--------------+------+-----+---------+-------+
mysql
’s -B
we get TSV output:
$ mysql --database=mydb -B -e 'show columns in mytable' | mlr --itsvlite --opprint cat Field Type Null Key Default Extra id bigint(20) NO MUL NULL - category varchar(256) NO - NULL - is_permanent tinyint(1) NO - NULL - assigned_to bigint(20) YES - NULL - last_update_time int(11) YES - NULL -
$ mysql --database=mydb -B -e 'show columns in mytable' | mlr --itsvlite --ojson --jlistwrap --jvstack cat [ { "Field": "id", "Type": "bigint(20)", "Null": "NO", "Key": "MUL", "Default": "NULL", "Extra": "" }, { "Field": "category", "Type": "varchar(256)", "Null": "NO", "Key": "", "Default": "NULL", "Extra": "" }, { "Field": "is_permanent", "Type": "tinyint(1)", "Null": "NO", "Key": "", "Default": "NULL", "Extra": "" }, { "Field": "assigned_to", "Type": "bigint(20)", "Null": "YES", "Key": "", "Default": "NULL", "Extra": "" }, { "Field": "last_update_time", "Type": "int(11)", "Null": "YES", "Key": "", "Default": "NULL", "Extra": "" } ]
$ mysql --database=mydb -B -e 'select * from mytable' > query.tsv $ mlr --from query.tsv --t2p stats1 -a count -f id -g category,assigned_to category assigned_to id_count special 10000978 207 special 10003924 385 special 10009872 168 standard 10000978 524 standard 10003924 392 standard 10009872 108 ...
mysql> CREATE TABLE abixy( a VARCHAR(32), b VARCHAR(32), i BIGINT(10), x DOUBLE, y DOUBLE ); Query OK, 0 rows affected (0.01 sec) bash$ mlr --onidx --fs comma cat data/medium > medium.nidx mysql> LOAD DATA LOCAL INFILE 'medium.nidx' REPLACE INTO TABLE abixy FIELDS TERMINATED BY ',' ; Query OK, 10000 rows affected (0.07 sec) Records: 10000 Deleted: 0 Skipped: 0 Warnings: 0 mysql> SELECT COUNT(*) AS count FROM abixy; +-------+ | count | +-------+ | 10000 | +-------+ 1 row in set (0.00 sec) mysql> SELECT * FROM abixy LIMIT 10; +------+------+------+---------------------+---------------------+ | a | b | i | x | y | +------+------+------+---------------------+---------------------+ | pan | pan | 1 | 0.3467901443380824 | 0.7268028627434533 | | eks | pan | 2 | 0.7586799647899636 | 0.5221511083334797 | | wye | wye | 3 | 0.20460330576630303 | 0.33831852551664776 | | eks | wye | 4 | 0.38139939387114097 | 0.13418874328430463 | | wye | pan | 5 | 0.5732889198020006 | 0.8636244699032729 | | zee | pan | 6 | 0.5271261600918548 | 0.49322128674835697 | | eks | zee | 7 | 0.6117840605678454 | 0.1878849191181694 | | zee | wye | 8 | 0.5985540091064224 | 0.976181385699006 | | hat | wye | 9 | 0.03144187646093577 | 0.7495507603507059 | | pan | wye | 10 | 0.5026260055412137 | 0.9526183602969864 | +------+------+------+---------------------+---------------------+
mysql> SELECT a, b, COUNT(*) AS count FROM abixy GROUP BY a, b ORDER BY COUNT DESC; +------+------+-------+ | a | b | count | +------+------+-------+ | zee | wye | 455 | | pan | eks | 429 | | pan | pan | 427 | | wye | hat | 426 | | hat | wye | 423 | | pan | hat | 417 | | eks | hat | 417 | | pan | zee | 413 | | eks | eks | 413 | | zee | hat | 409 | | eks | wye | 407 | | zee | zee | 403 | | pan | wye | 395 | | wye | pan | 392 | | zee | eks | 391 | | zee | pan | 389 | | hat | eks | 389 | | wye | eks | 386 | | wye | zee | 385 | | hat | zee | 385 | | hat | hat | 381 | | wye | wye | 377 | | eks | pan | 371 | | hat | pan | 363 | | eks | zee | 357 | +------+------+-------+ 25 rows in set (0.01 sec)
$ mlr --opprint uniq -c -g a,b then sort -nr count data/medium a b count zee wye 455 pan eks 429 pan pan 427 wye hat 426 hat wye 423 pan hat 417 eks hat 417 eks eks 413 pan zee 413 zee hat 409 eks wye 407 zee zee 403 pan wye 395 hat pan 363 eks zee 357
$ mysql -D miller -B -e 'select * from abixy' | mlr --itsv --opprint uniq -c -g a,b then sort -nr count a b count zee wye 455 pan eks 429 pan pan 427 wye hat 426 hat wye 423 pan hat 417 eks hat 417 eks eks 413 pan zee 413 zee hat 409 eks wye 407 zee zee 403 pan wye 395 wye pan 392 zee eks 391 zee pan 389 hat eks 389 wye eks 386 hat zee 385 wye zee 385 hat hat 381 wye wye 377 eks pan 371 hat pan 363 eks zee 357
grep
or what have you. Also it means not
every line needs to have the same list of field names (“schema ”).
Again, all the examples in the CSV section apply here — just change
the input-format flags. But there’s more you can do when not all the
records have the same shape.
Writing a program — in any language whatsoever — you can have
it print out log lines as it goes along, with items for various events jumbled
together. After the program has finished running you can sort it all out,
filter it, analyze it, and learn from it.
Suppose your program has printed something like this:
POKI_RUN_COMMAND{{cat log.txt}}HERE
Each print statement simply contains local information: the current
timestamp, whether a particular cache was hit or not, etc. Then using either
the system grep
command, or Miller’s having-fields
, or
is_present
, we can pick out the parts we want and analyze them:
POKI_INCLUDE_AND_RUN_ESCAPED(10-1.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(10-2.sh)HERE
Alternatively, we can simply group the similar data for a better look:
POKI_RUN_COMMAND{{mlr --opprint group-like log.txt}}HERE
POKI_RUN_COMMAND{{mlr --opprint group-like then sec2gmt time log.txt}}HERE