0

I have been using jq to successfully extract one JSON blob at a time from some relatively large files and write it out to a file of one JSON object per line for further processing. Here is an example of the JSON format:

{
  "date": "2023-07-30",
  "results1":[
    {
      "data": [    
        {"row": [{"key1": "row1", "key2": "row1"}]},
        {"row": [{"key1": "row2", "key2": "row2"}]}
      ]
    },
    {
      "data": [    
        {"row": [{"key1": "row3", "key2": "row3"}]},
        {"row": [{"key1": "row4", "key2": "row4"}]}
      ]
    }
  ],
  "results2":[
    {
      "data": [    
        {"row": [{"key3": "row1", "key4": "row1"}]},
        {"row": [{"key3": "row2", "key4": "row2"}]}
      ]
    },
    {
      "data": [    
        {"row": [{"key3": "row3", "key4": "row3"}]},
        {"row": [{"key3": "row4", "key4": "row4"}]}
      ]
    }
  ]
}

My current approach is to run the following and redirect the stdout to a file:

jq -rc ".results1[]" my_json.json

This works fine, however, it seems like jq reads the entire file into memory in order to extract the chunk I am interested in.

Questions:

  1. Does jq read the entire file into memory when I execute the above statement?
  2. Assuming the answer is yes, is there a way that I can extract results1[] and results2[] on the same call to avoid reading the file twice?

I have used the --stream option but it is very slow. I also read that it sacrifices speed for memory savings, but memory is not an issue at this time so I would prefer to avoid using this option. Basically, what I need is to read in the above json once and output two files in JSON lines format.

Edit: (I changed the input data a bit to show the differences in the output)

Output file 1:

{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}

Output file 2:

{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}

It seems pretty well known that the streaming option is slow. See the discussion here.

My attempt at implementing it followed the answer here.

fsumathguy
  • 95
  • 6
  • `I have used the --stream option but it is very slow.` I doubt that. Could you show - how you implemented that with stream? – Inian Jul 31 '23 at 12:47
  • Also please post the _exact_ desired output and not leave it to speculation – Inian Jul 31 '23 at 12:47

2 Answers2

1

doesn't have any file IO facilities, so you can't output multiple files.

You can output each piece of data with it's key and post-process it:

jq -r '
    to_entries[]
    | select(.key != "date")
    | .key as $k
    | .value[]
    | [$k, @json]
    | @tsv
' my_json.json

outputs

results1    {"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
results1    {"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
results2    {"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
results2    {"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}

So:

while IFS=$'\t' read -r key json; do
    printf '%s\n' "$json" >> "${key}.jsonl"
done < <(
    jq -r '...' my_json.json
)

or

jq -r '...' my_json.json | awk -F '\t' '{print $2 > ($1 ".jsonl")}'
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
1

With Bash ≥ 4, processing bigger chunks could be improved by reading n lines at once using mapfile:

jq -cr '$ARGS.positional[] as $key | .[$key] | $key, length, .[]' input.json \
  --args results1 results2 | while read -r key; read -r len
do mapfile -t -n $len
  printf '%s\n' "${MAPFILE[@]}" > "$key.jsonl"
done
pmf
  • 24,478
  • 2
  • 22
  • 31