1

i have a json containing an array of objects, every object contains a unique value in:

"id":"value"

i've followed this other answer and i can split the whole document in multiple files using jq and awk

jq -c ".[]" big.json | gawk '{print > "doc00" NR ".json";}'

in this way the output files are named sequentially.
how i can name the files using the id value?

peak
  • 105,803
  • 17
  • 152
  • 177
George Livanoss
  • 443
  • 1
  • 4
  • 14

3 Answers3

2

For each element in the array, print id and the element itself in two separate lines, thus you can grab the id from odd numbered lines and print even numbered lines to files named with id.

jq -cr '.[] | .id, .' big.json | awk 'NR%2{f=$0".json";next} {print >f;close(f)}'
oguz ismail
  • 1
  • 16
  • 47
  • 69
  • 1
    The OP did not use the -r command-line option, and using it comes with a potential risk (having to do with strings with embedded newline characters) that should be mentioned. If it’s known that the JSON entities being printed are not JSON strings, then using -r is pointless, so on the whole it would be best to omit it unless it can be established that it is really necessary. – peak May 16 '19 at 12:54
  • @peak Yes you have a point, but if we drop -r we should remove double quotes within the awk script, which doesn't sound like a good idea to me:/ – oguz ismail May 16 '19 at 13:06
  • And if the array element has a key, it's guaranteed to be an object, and -r will not change it, right? – oguz ismail May 16 '19 at 13:08
  • Yes, but in the particular case described by the OP, the risk is that the .id might contain a newline. – peak May 16 '19 at 13:51
  • oh yes, that would be a bummer – oguz ismail May 16 '19 at 14:18
2

Using .id as part of a filename is fraught with risk.

First, there is the potential problem of embedded newline characters.

Second, there is the problem of "reserved" characters, notably "/".

Third, Windows has numerous restrictions on file names -- see e.g. https://gist.github.com/doctaphred/d01d05291546186941e1b7ddc02034d3).

Also, if jq's -r option is used, as suggested in another posting on this page, then .id values of "1" and 1 will both be mapped to 1, which will result in loss of data if ">" is used in awk.

So here is a solution that illustrates how safety can be achieved in an OS X or *ix environment and that goes a long way towards a safe solution for Windows:

jq -c '.[]
       | (.id | if type == "number" then .
                else tostring | gsub("[^A-Za-z0-9-_]";"+") end), .' |
awk '
  function fn(s) { sub(/^\"/,"",s); sub(/\"$/,"",s); return s ".json"; }
  NR%2{f=fn($0); next} 
  {print >> f; close(f);}
' 

Notice especially the use of ">>" to avoid losing data in the case of file name collisions.

peak
  • 105,803
  • 17
  • 152
  • 177
1

Since the problem description indicates the input array is huge, it might be worth considering using jq's streaming parser. In general, this would be appropriate if the input JSON is too large to read into memory, or if reducing computer memory requirements is an important goal.

In brief, instead of invoking jq in the normal way, one adds the -n and --stream command-line options, and replaces the initial .[] by:

fromstream(1|truncate_stream(inputs))

Handling the splitting can then be done as described elsewhere on this page.

peak
  • 105,803
  • 17
  • 152
  • 177