12

I get a very large JSON stream (several GB) from curl and try to process it with jq.

The relevant output I want to parse with jq is packed in a document representing the result structure:

{
  "results":[
    {
      "columns": ["n"],

      // get this
      "data": [    
        {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
        {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
      //  ... millions of rows      

      ]
    }
  ],
  "errors": []
}

I want to extract the row data with jq. This is simple:

curl XYZ | jq -r -c '.results[0].data[0].row[]'

Result:

{"key1": "row1", "key2": "row1"}
{"key1": "row2", "key2": "row2"}

However, this always waits until curl is completed.

I played with the --stream option which is made for dealing with this. I tried the following command but is also waits until the full object is returned from curl:

curl XYZ | jq -n --stream 'fromstream(1|truncate_stream(inputs)) | .[].data[].row[]'

Is there a way to 'jump' to the data field and start parsing row one by one without waiting for closing tags?

Martin Preusse
  • 9,151
  • 12
  • 48
  • 80

3 Answers3

8

To get:

{"key1": "row1", "key2": "row1"}
{"key1": "row2", "key2": "row2"}

From:

{
  "results":[
    {
      "columns": ["n"],
      "data": [    
        {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
        {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
      ]
    }
  ],
  "errors": []
}

Do the following, which is equivalent to jq -c '.results[].data[].row[]', but using streaming:

jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row") | del(.[0][0:5])))'

What this does is:

  • Turn the JSON into a stream (with --stream)
  • Select the path .results[].data[].row[] (with select(.[0][0] == "results" and .[0][2] == "data" and .[0][4] == "row")
  • Discard those initial parts of the path, like "results",0,"data",0,"row" (with del(.[0][0:5]))
  • And finally turn the resulting jq stream back into the expected JSON with the fromstream(1|truncate_stream(…)) pattern from the jq FAQ

For example:

echo '
  {
    "results":[
      {
        "columns": ["n"],
        "data": [    
          {"row": [{"key1": "row1", "key2": "row1"}], "meta": [{"key": "value"}]},
          {"row": [{"key1": "row2", "key2": "row2"}], "meta": [{"key": "value"}]}
        ]
      }
    ],
    "errors": []
  }
' | jq -cn --stream '
  fromstream(1|truncate_stream(
    inputs | select(
      .[0][0] == "results" and 
      .[0][2] == "data" and 
      .[0][4] == "row"
    ) | del(.[0][0:5])
  ))'

Produces the desired output.

mindeh
  • 1,808
  • 14
  • 12
James McKinney
  • 2,831
  • 1
  • 18
  • 8
5

(1) The vanilla filter you would use would be as follows:

jq -r -c '.results[0].data[].row'

(2) One way to use the streaming parser here would be to use it to process the output of .results[0].data, but the combination of the two steps will probably be slower than the vanilla approach.

(3) To produce the output you want, you could run:

jq -nc --stream '
  fromstream(inputs
    | select( [.[0][0,2,4]] == ["results", "data", "row"])
    | del(.[0][0:5]) )'

(4) Alternatively, you may wish to try something along these lines:

jq -nc --stream 'inputs
      | select(length==2)
      | select( [.[0][0,2,4]] == ["results", "data", "row"])
      | [ .[0][6], .[1]] '

For the illustrative input, the output from the last invocation would be:

["key1","row1"] ["key2","row1"] ["key1","row2"] ["key2","row2"]

peak
  • 105,803
  • 17
  • 152
  • 177
  • 1
    It's not only about speed. Without streaming memory usage explodes and curl/jq crash. – Martin Preusse Aug 30 '16 at 18:26
  • Thanks, your update really helped a lot. I'm not really getting there because all rows are combined in the output. And I haven't figured out how to collect all key/value pairs for the individual rows. I updated the data description in the question. – Martin Preusse Aug 30 '16 at 23:12
  • All I have to do is combine all arrays where `.[0][3]` is equal into one object. – Martin Preusse Aug 30 '16 at 23:20
0

Thanks to the "JSON Machine" library, there's a simple and relatively fast solution to the original problem that avoids the disadvantages (*) of jq's streaming parser (jq --stream), though it does entail installing more software.

To make it trivial to use, I wrote a script named jm (which can be found here). With this script, one has only to write:

curl ... | jm --pointer /results/0/data

Or, if you want to stream the .data from all objects in the .results array:

curl ... | jm --pointer /results/-/data

(*) The main disadvantages being slowness and obscurity.

peak
  • 105,803
  • 17
  • 152
  • 177