What is the best way to `sort_by` a huge file that doesn't fit in memory with `jq`

Question

I'm working with a text file that has one JSON object per line and I want to use jq to select, group_by (key1), and sort_by (key1) the file. The file looks like this:

# /tmp/sample.json
{"key1": "value11", "key2": "value21", "key3": "value31"}
{"key1": "value11", "key2": "value22", "key3": "value32"}
{"key1": "value11", "key2": "value22", "key3": "value32"}
{"key1": "value13", "key2": "value23", "key3": "value33"}
{"key1": "value13", "key2": "value24", "key3": "value34"}
{"key1": "value16", "key2": "value26", "key3": "value36"}
{"key1": "value17", "key2": "value27", "key3": "value37"}
...

I'm running the file through Hadoop MapReduce in a similar manner to this question:

hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
    -files $HOME/bin/jq,$HOME/proj-map.jq,$HOME/proj-reduce.jq \
    -mapper "./jq -c --from-file=proj-map.jq" \
    -reducer "./jq -ncr --from-file=proj-reduce.jq" \
    -input  /tmp/sample.json \
    -output /tmp/sample.json.output

with

#proj-map.jq
# some transformation
{key1, key2}

and

#proj-reduce.jq
# by @peak -- https://stackoverflow.com/a/45715729/948914
# sort-free stream-oriented variant of group_by/1
# f should always evaluate to a string.
# Output: a stream of arrays, one array per group
def GROUPS_BY(stream; f): reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;

GROUPS_BY(inputs|.key1; .) | {key1: .[0], size: length} | (.size|tostring) + "\t" + tostring

The above yields something that I can feed into Unix sort for sorting:

3 {"key1": "value11", "size": 3}
2 {"key1": "value13", "size": 2}
1 {"key1": "value16", "size": 1}
1 {"key1": "value17", "size": 1}

This works. Now, I don't want to rely on Unix sort and I'm looking for a way to use jq's sort_by(). I figured out that this can be challenging because from what I understand, sort_by() requires an array as input, which implies that the array is loaded in memory. Since the file may not fit in memory, I'm looking for a way using jq's sort_by() without reading the entire file in memory. In particular, I'm interested in an efficient, streaming-type way of sorting, similar to Unix sort, or to the streaming group_by().

If there is no such way, then to the best of my knowledge is this answer, which combines jq and Unix sort, as I showed above. Obviously it would be great is sort_by worked like Unix sort but I don't have the means to find out.

Streaming sorting is not possible. If your first element needs to come last, you have to store it till you can output it, there's no way around it. `sort` is cheating by using disk on large data; `jq` can't. — Amadan, Feb 19 '20 at 03:16

peak · Answer 1 · 2020-02-22T22:00:10.197

[The following was written before the question was updated to explain that the input consists of multiple JSON entities.]

To simplify things a bit, the following assumes that you have a huge file consisting of a single JSON array. Since, by assumption, this file is too big to read into memory, the first step will be to get each of the top-level array elements on a line by itself. That can be done using jq's --stream command-line option, as described in the jq FAQ, e.g. perhaps along the lines of:

jq -cn --stream 'fromstream( inputs|(.[0] |= .[1:]) | select(. != [[]]) )'

The next step is to prefix each of these lines with the "sort by" value, as described in the link included in the Q. (That is, jq can easily be used.)

Next, run the operating system sort.

Finally, if you really need the result as a single large array, you could use a text-processing tool (e.g. awk).

Thank you, @peak. You're tackling a more challenging problem than the one I have. Appreciated it. My use case is slightly easier because the file contains one JSON object per line. I updated my question to reflect this. Thanks again. — NucFlash, Feb 22 '20 at 21:04

What is the best way to `sort_by` a huge file that doesn't fit in memory with `jq`

1 Answers1