I'm working with a text file that has one JSON object per line and I want to use jq to select, group_by (key1), and sort_by (key1) the file. The file looks like this:
# /tmp/sample.json
{"key1": "value11", "key2": "value21", "key3": "value31"}
{"key1": "value11", "key2": "value22", "key3": "value32"}
{"key1": "value11", "key2": "value22", "key3": "value32"}
{"key1": "value13", "key2": "value23", "key3": "value33"}
{"key1": "value13", "key2": "value24", "key3": "value34"}
{"key1": "value16", "key2": "value26", "key3": "value36"}
{"key1": "value17", "key2": "value27", "key3": "value37"}
...
I'm running the file through Hadoop MapReduce in a similar manner to this question:
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-files $HOME/bin/jq,$HOME/proj-map.jq,$HOME/proj-reduce.jq \
-mapper "./jq -c --from-file=proj-map.jq" \
-reducer "./jq -ncr --from-file=proj-reduce.jq" \
-input /tmp/sample.json \
-output /tmp/sample.json.output
with
#proj-map.jq
# some transformation
{key1, key2}
and
#proj-reduce.jq
# by @peak -- https://stackoverflow.com/a/45715729/948914
# sort-free stream-oriented variant of group_by/1
# f should always evaluate to a string.
# Output: a stream of arrays, one array per group
def GROUPS_BY(stream; f): reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;
GROUPS_BY(inputs|.key1; .) | {key1: .[0], size: length} | (.size|tostring) + "\t" + tostring
The above yields something that I can feed into Unix sort for sorting:
3 {"key1": "value11", "size": 3}
2 {"key1": "value13", "size": 2}
1 {"key1": "value16", "size": 1}
1 {"key1": "value17", "size": 1}
This works. Now, I don't want to rely on Unix sort and I'm looking for a way to use jq's sort_by()
. I figured out that this can be challenging because from what I understand, sort_by()
requires an array as input, which implies that the array is loaded in memory. Since the file may not fit in memory, I'm looking for a way using jq's sort_by()
without reading the entire file in memory. In particular, I'm interested in an efficient, streaming-type way of sorting, similar to Unix sort, or to the streaming group_by()
.
If there is no such way, then to the best of my knowledge is this answer, which combines jq
and Unix sort
, as I showed above. Obviously it would be great is sort_by
worked like Unix sort
but I don't have the means to find out.