1

I need to convert JSON to CSV where JSON has arrays of variable length, for example:

JSON objects:

{"labels": ["label1"]}
{"labels": ["label2", "label3"]}
{"labels": ["label1", "label4", "label5"]}

Resulting CSV:

labels,labels,labels
"label1",,
"label2","label3",
"label1","label4","label5"

There are many other properties in the source JSON, this is just an exсerpt for the sake of simplicity.

Also, I need to say that the process has to work with JSON as a stream because source JSON could be very large (>1GB).

I wanted to use jq with two passes, the first pass would collect the maximum length of the 'labels' array, the second pass would create CSV as the number of the resulting columns is known by this time. But jq doesn't have a concept of global variables, so I don't know where I can store the running total.

I'd like to be able to do that on Windows via CLI. Thank you in advance.

2 Answers2

2

The question shows a stream of JSON objects, so the following solutions assume that the input file is already a sequence as shown. These solutions can also easily be adapted to cover the case where the input file contains a huge array of objects, e.g. as discussed in the epilog.

A two-invocation solution

Here's a two-pass solution using two invocations of jq. The presentation assumes a bash-like environment, in case you have :

n=$(jq -n 'reduce (inputs|.labels|length) as $i (-1;
  if $i > . then $i else . end)' stream.json)
jq -nr --argjson n $n '
  def fill($n): . + [range(length;$n)|null];
  [range(0;$n)|"labels"],
  (inputs | .labels | fill($n))
  | @csv' stream.json

Assuming the input is as described, this is guaranteed to produce valid CSV. Hopefully you can adapt the above to your shell as necessary -- maybe this link will help: Assign output of a program to a variable using a MS batch file

Using input_filename and a single invocation of jq

Unfortunately, jq does not have a "rewind" facility, but there is an alternative: read the file twice within a single invocation of jq. This is more cumbersome than the two-invocation solution above but avoids any difficulties associated with the latter.

cat sample.json | jq -nr '

  def fill($n): . + [range(length;$n)|null];
  def max($x): if . < $x then $x else . end;

  foreach (inputs|.labels) as $in ( {n:0};
    if input_filename == "<stdin>" 
    then .n |= max($in|length)
    else .printed+=1
    end;
    if .printed == null then empty
    else .n as $n
    | (if .printed == 1 then [range(0;$n)|"labels"] else empty end),
      ($in | fill($n))
    end)
  | @csv'  -  sample.json

Another single-invocation solution

The following solution uses a special value (here null) to delineate the two streams:

(cat stream.json; echo null; cat stream.json) | jq -nr '
  def fill($n): . + [range(length; $n) | null];
  def max($x): if . < $x then $x else . end;

  (label $loop | foreach inputs as $in (0; 
     if $in == null then . else max($in|.labels|length) end;
     if $in == null then ., break $loop else empty end)) as $n
  | [range(0;$n)|"labels"],
    (inputs | .labels | fill($n))
  | @csv '

Epilog

A file with a top-level JSON array that is too large to fit into memory can be converted into a stream of the array's items by invoking jq with the --stream option, e.g. as follows:

jq -cn --stream 'fromstream(1|truncate_stream(inputs))'
peak
  • 105,803
  • 17
  • 152
  • 177
  • Oh right, we could read in the same file twice. ...actually, maybe not, the file gets locked. – Jeff Mercado Jun 12 '19 at 23:05
  • @peak This is fantastic, thank you a lot for the help! It took some time to go through each of the examples and understand the algorithm. I might have misled you with my example, but, in my case, the number of labels doesn't correlate with the current line number. In the first example, you calculate the max length of labels array and it should work. But in the second and third example, it seems to me like you consider the maximum number of labels to be equal the total number of lines which should not work with my data. Please correct me if I'm wrong. – Andrey Kiyanovsky Jun 15 '19 at 14:51
  • Fixed. (I hope). Tx. – peak Jun 15 '19 at 19:57
1

For such a large file, you will probably want to do this in two separate invocations, one to get the count, then another to actually output the csv. If you wanted to read the whole file into memory, you could do this in one, but we definitely don't want to do that, we'll want to stream it in where possible.

Things get a little ugly when it comes to storing the result of commands to a variable, writing to a file might be simpler. But I'd rather not use temp files if we don't have to.

REM assuming in a batch file
for /f "usebackq delims=" %%i in (`jq -n --stream "reduce (inputs | .[0][1] + 1) as $l (0; if $l > . then $l else . end)" input.json`) do set cols=%%i
jq -rn --stream --argjson cols "%cols%" "[range($cols)|\"labels\"],(fromstream(1|truncate_stream(inputs))|[.[],(range($cols-length)|null)])|@csv" input.json

> jq -n --stream "reduce (inputs | .[0][1] + 1) as $l (0; if $l > . then $l else . end)" input.json

For the first invocation to get the count of columns, we're just taking advantage of the fact that the paths to the array values could be used to indicate the lengths of the arrays. We'll just want to take the max across all items.


> jq -rn --stream --argjson cols "%cols%" ^
"[range($cols)|\"labels\"],(fromstream(1|truncate_stream(inputs))|[.[],(range($cols-length)|null)])|@csv" input.json

Then to output the rest, we're just taking the labels array (assuming it's the only property on the objects) and padding them out with null up to the $cols count. Then output as csv.


If the labels are in a different, deeply nested path than what's in your example here, you'll need to select based on the appropriate paths.

set labelspath=foo.bar.labels
jq -rn --stream --argjson cols "%cols%" --arg labelspath "%labelspath%" ^
"($labelspath|split(\".\")|[.,length]) as [$path,$depth] | [range($cols)|\"labels\"],(fromstream($depth|truncate_stream(inputs|select(.[0][:$depth] == $path)))|[.[],(range($cols-length)|null)])|@csv" input.json
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • Thank you very much, @jeff-mercado! It took a while to figure out how streams work in jq, but it was worthwhile. In my specific use case, I don't use streams as my input data has a considerable number of small JSON objects rather than one big JSON. But anyway, it was beneficial! – Andrey Kiyanovsky Jun 15 '19 at 17:09