group_by behavior when using --stream

Question

Having (simplified for learning) input file:

{"type":"a","id":"1"}
{"type":"a","id":"2"}
{"type":"b","id":"1"}
{"type":"c","id":"3"}

I'd like to turn it into:

{
    "a": [1,2],
    "b": [1],
    "c": [3]
}

via using --stream option, not needed here, just for learning. Or at least it does not seem that viable to use group_by or reduce without it on bigger files (even few G seems to be rather slow)

I understand that I can write smth like:

jq --stream -cn 'reduce (inputs|select(length==2)) as $i([]; . + ..... )' test3

but that would just process the data per line(processed item in stream), ie I can either see type or id, and this does not have place where to create pairing. I can cram it to one big array, but that opposite of what I have to do.

How to create such pairings? I don't even know how to create(using --stream):

{"a":1}
{"a":2}
...

I know both (first target transformation, and the one above this paragraph) are probably some trivial usage of for each, I have some working example of one here, but all it's .accumulator and .complete keywords(IIUC) are now just magic. I understood it once, but ... Sorry for trivial questions.

UPDATE regarding performace:

@pmf provided in his answer 2 solutions: streaming and non streaming. Thanks for that, I was able to write non-streaming version, but not the streaming one. But when testing it, the streaming variant was (I'm not 100% sure now, but ...) 2-4 times slower. Makes sense if data does not fit into memory, but luckily in my case, they do. So I ran the non streaming version for ~1G file on laptop, but not actually that slow i7-9850H CPU @ 2.60GHz. For my surprise it wasn't done withing 16hours so I killed it as not viable solution for my usecase of potentially a lot bigger input files. Considering simplicity of input, I decided to write pipeline just via using bash, grep,sed,paste and tr, and eventhough it was using some regexes, and was overally inefficient as hell, and without any parallelism, the whole file was correctly crunched in 55 seconds. I understand that character manipulation is faster than parsing json, but that much difference? Isn't there some better approach while still parsing json? I don't mind spending more cpu power, but if I'm using jq, I'd like to use it's functions and process json as json, not just chars just as I did it with bash.

pmf · Answer 1 · 2022-12-03T13:22:53.897

0

In the "unstreamed" case I`d use

jq -n 'reduce inputs as $i ({}; .[$i.type] += [$i.id | tonumber])'

Demo

With the --stream option set, just re-create the streamed items using fromstream:

jq --stream -n 'reduce fromstream(inputs) as $i ({}; .[$i.type] += [$i.id | tonumber])'

{
  "a": [1,2],
  "b": [1],
  "c": [3]
}

edited Dec 03 '22 at 13:22

answered Dec 03 '22 at 13:13

pmf

24,478
2
22
31

uff, ok. The documentation is close to none regarding fromstream. Can you explain what does it do, how it works, to know how it emits stuff and what amount of data will be present in memory at same time. Because, iiuc, the two presented solutions are effectively the same -- it loads whole json into ram and then reduce it, no? My point in not using first (which I know how to write) is that it's WAY too slow, and I'd like to speed it up somehow. No idea how to do it, nor if that streaming variant, what you show or maybe some other which does not have whole json in ram, will help meaningfully – Martin Mucha Dec 03 '22 at 14:24
and just from naive testing the first approach is much faster than one with --stream and fromstream... – Martin Mucha Dec 03 '22 at 14:37
@MartinMucha Both keep only one object in memory (as opposed to using the `--slurp` option). The second is slower because it additionally [breaks down](https://stackoverflow.com/questions/74665921/effectively-accessing-first-item-in-object/74666066?noredirect=1#comment131789961_74666066) the input and puts it back together. `fromstream` accumulates those broken down stream items, and once an object is complete (backtracking on top level), it will emit it. See the [docs](https://github.com/stedolan/jq/blob/master/src/builtin.jq#L227-L234). – pmf Dec 03 '22 at 14:42
please see updated question, remark about performance. The reduce in question is expected to perform like this? – Martin Mucha Dec 04 '22 at 13:04
@MartinMucha No, it is not. A simple filter (like the two above) should run through in reasonable time. This is because your input is already a stream of objects, so they can be worked off independent of each other, regardless of in what format you do it (as is, or reformatted using `--stream`, which in comparison is obviously a bit slower). So, without creating a superordinate structure (as with `--slurp`), without nested iterations (as with `.[]`), and with an input consisting of many small objects (not a few gigantic ones) I don't see a reason why this should be orders of magnitudes slower. – pmf Dec 04 '22 at 17:38
ok, so if you happen to remember my other question from yesterday, the actual input is object, where fields are string, typically there is just one such field. Value of that field would be array of this question. So we maintained, that reduce from this question is OK. If the actual template is: `time jq 'keys[0] as $k|.[$k]| reduce .[] as $item({}; .[$item.type]|=(.+[$item.id]))' dump.json > out` would the small differences like unwrapping one outer structure and |= .+… instead of += … be responsible for this performance degradation? – Martin Mucha Dec 04 '22 at 18:54
@MartinMucha It's hard for me to follow but to my latest understanding your actual input is in fact not a stream, then. In this case you might want to `--stream` it, and to cut down to the keys of interest using `truncate_stream`, maybe filtered using `select`. I suggest opening a new question with all relevant parts included. Also note, that `keys` performs a sorting while `keys_unsorted` doesn't (this may or may not be a bottleneck). – pmf Dec 05 '22 at 01:22
sure, sorry for confusion, I'll open new question, with proper reproducible sample – Martin Mucha Dec 05 '22 at 08:27

group_by behavior when using --stream

1 Answers1