19

I've got a tool that outputs a JSON record on each line, and I'd like to process it with jq.

The output looks something like this:

{"ts":"2017-08-15T21:20:47.029Z","id":"123","elapsed_ms":10}
{"ts":"2017-08-15T21:20:47.044Z","id":"456","elapsed_ms":13}

When I pass this to jq as follows:

./tool | jq 'group_by(.id)'

...it outputs an error:

jq: error (at <stdin>:1): Cannot index string with string "id"

How do I get jq to handle JSON-record-per-line data?

peak
  • 105,803
  • 17
  • 152
  • 177
Roger Lipscombe
  • 89,048
  • 55
  • 235
  • 380
  • 1
    There's an almost-duplicate here: https://stackoverflow.com/q/34477547/8446, but it asks two questions. This asks just one question. – Roger Lipscombe Aug 16 '17 at 13:16

2 Answers2

20

Use the --slurp (or -s) switch:

./tool | jq --slurp 'group_by(.id)'

It outputs the following:

[
  [
    {
      "ts": "2017-08-15T21:20:47.029Z",
      "id": "123",
      "elapsed_ms": 10
    }
  ],
  [
    {
      "ts": "2017-08-15T21:20:47.044Z",
      "id": "456",
      "elapsed_ms": 13
    }
  ]
]

...which you can then process further. For example:

./tool | jq -s 'group_by(.id) | map({id: .[0].id, count: length})'
Roger Lipscombe
  • 89,048
  • 55
  • 235
  • 380
  • 1
    It should be pointed out that filters like `group_by/1` takes an array as input. The original input was a stream of objects so they had to be collected into an array first (i.e., slurped). – Jeff Mercado Aug 16 '17 at 16:31
  • --slurp is fine, but if the file is really huge, it won't fit in memory. I know about --stream but that one is SO exotic, that anything what was trivial to do becomes challenge, so it's easier to just invoke jq per line and parallelize it. Very wasteful. Is there some jq way how to process per line without --stream? – Martin Mucha Aug 19 '22 at 08:18
  • @MartinMucha the other answer uses streaming. I also wrote a blog post about it: https://blog.differentpla.net/blog/2019/01/11/jq-reduce/ – Roger Lipscombe Aug 19 '22 at 09:04
  • @RogerLipscombe and _that_ will _not_ cause the initial file to be loaded into memory whole? I thought otherwise ... – Martin Mucha Aug 19 '22 at 10:24
5

As @JeffMercado pointed out, jq handles streams of JSON just fine, but if you use group_by, then you'd have to ensure its input is an array. That could be done in this case using the -s command-line option; if your jq has the inputs filter, then it can also be done using that filter in conjunction with the -n option.

If you have a version of jq with inputs (which is available in jq 1.5), however, then a better approach would be to use the following streaming variant of group_by:

 # sort-free stream-oriented variant of group_by/1
 # f should always evaluate to a string.
 # Output: a stream of arrays, one array per group
 def GROUPS_BY(stream; f): reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;

Usage example: GROUPS_BY(inputs; .id)

Note that you will want to use this with the -n command line option.

Such a streaming variant has two main advantages:

  1. it generally requires less memory in that it does not require a copy of the entire input stream to be kept in memory while it is being processed;
  2. it is potentially faster because it does not require any sort operation, unlike group_by/1.

Please note that the above definition of GROUPS_BY/2 follows the convention for such streaming filters in that it produces a stream. Other variants are of course possible.

Handling a large amount of data

The following illustrates how to economize on memory. Suppose the task is to produce a frequency count of .id values. The humdrum solution would be:

GROUPS_BY(inputs; .id) | [(.[0]|.id), length]

A more economical and indeed far better solution would be:

GROUPS_BY(inputs|.id; .) | [.[0], length]
peak
  • 105,803
  • 17
  • 152
  • 177
  • 1
    This version of group_by() is pretty nice. It may be worth adding to [builtin.jq](https://github.com/stedolan/jq/blob/master/src/builtin.jq) – jq170727 Aug 16 '17 at 18:03
  • @jq170727 - I've submitted a PR but it probably won't go anywhere for a while. In the meantime, Windows users can get the updated jq.exe from https://ci.appveyor.com/project/stedolan/jq/build/1.0.212 – peak Aug 16 '17 at 22:00