21

I have a large JSON file with I'm guessing 4 million objects. Each top level has a few levels nested inside. I want to split that into multiple files of 10000 top level objects each (retaining the structure inside each). jq should be able to do that right? I'm not sure how.

So data like this:

[{
  "id": 1,
  "user": {
    "name": "Nichols Cockle",
    "email": "ncockle0@tmall.com",
    "address": {
      "city": "Turt",
      "state": "Thị Trấn Yên Phú"
    }
  },
  "product": {
    "name": "Lychee - Canned",
    "code": "36987-1526"
  }
}, {
  "id": 2,
  "user": {
    "name": "Isacco Scrancher",
    "email": "iscrancher1@aol.com",
    "address": {
      "city": "Likwatang Timur",
      "state": "Biharamulo"
    }
  },
  "product": {
    "name": "Beer - Original Organic Lager",
    "code": "47993-200"
  }
}, {
  "id": 3,
  "user": {
    "name": "Elga Sikora",
    "email": "esikora2@statcounter.com",
    "address": {
      "city": "Wenheng",
      "state": "Piedra del Águila"
    }
  },
  "product": {
    "name": "Parsley - Dried",
    "code": "36987-1632"
  }
}, {
  "id": 4,
  "user": {
    "name": "Andria Keatch",
    "email": "akeatch3@salon.com",
    "address": {
      "city": "Arras",
      "state": "Iracemápolis"
    }
  },
  "product": {
    "name": "Wine - Segura Viudas Aria Brut",
    "code": "51079-385"
  }
}, {
  "id": 5,
  "user": {
    "name": "Dara Sprowle",
    "email": "dsprowle4@slate.com",
    "address": {
      "city": "Huatai",
      "state": "Kaduna"
    }
  },
  "product": {
    "name": "Pork - Hock And Feet Attached",
    "code": "0054-8648"
  }
}]

Where this is a single complete object:

{
  "id": 1,
  "user": {
    "name": "Nichols Cockle",
    "email": "ncockle0@tmall.com",
    "address": {
      "city": "Turt",
      "state": "Thị Trấn Yên Phú"
    }
  },
  "product": {
    "name": "Lychee - Canned",
    "code": "36987-1526"
  }
}

And each file would be a specified number of objects like that.

peak
  • 105,803
  • 17
  • 152
  • 177
Chaz
  • 787
  • 2
  • 9
  • 16
  • Can you be more specific about the structure of the file? Is the very top level object of the file an array? And is the file small enough to fit in RAM (i.e. if you feed it to the command `jq .` does it crash or work?) – hobbs Apr 13 '18 at 04:25
  • @hobbs Edited to add example. Yes, entire file is basically one giant array of objects. It's not crashing it, but it's struggling on just `.` – Chaz Apr 13 '18 at 14:12

2 Answers2

24

[EDIT: This answer has been revised in accordance with the revision to the question.]

The key to using jq to solve the problem is the -c command-line option, which produces output in JSON-Lines format (i.e., in the present case, one object per line). You can then use a tool such as awk or split to distribute those lines amongst several files.

If the file is not too big, then the simplest would be to start the pipeline with:

jq -c '.[]' INPUTFILE

If the file is too big to fit comfortably in memory, then you could use jq's streaming parser, like so:

jq -cn --stream 'fromstream(1|truncate_stream(inputs))'

Or you could use a command-line tool such as jstream or jm, which would be faster but which would of course have to be installed.

For further discussion about jq's streaming parser, see e.g. the relevant section in the jq FAQ: https://github.com/stedolan/jq/wiki/FAQ#streaming-json-parser

Partitioning

For different approaches to partitioning the output produced in the first step, see for example How can I split a large text file into smaller files with an equal number of lines?

If it is required that each of the output files be an array of objects, then I'd probably use awk to perform both the partitioning and the re-constitution in one step, but there are many other reasonable approaches.

If the input is a sequence of JSON objects

For reference, if the original file consists of a stream or sequence of JSON objects, then the appropriate invocation would be:

jq -n -c inputs INPUTFILE

Using inputs in this manner allows arbitrarily many objects to be processed efficiently.

peak
  • 105,803
  • 17
  • 152
  • 177
  • Thanks, but I think I'm missing something - `jq -n -c inputs INPUTFILE` is putting everything on a single line. Is it because the whole file is a giant array? – Chaz Apr 13 '18 at 14:25
  • Ok, I added `[]` and making progress! Now I just need to get the split output back in an array... – Chaz Apr 13 '18 at 14:33
  • Thanks for the great answer @peak! I've used this approach but I ran into some performance issues. Here's my related [SO question](https://stackoverflow.com/q/62825963/3112403) if you have any thoughts. – sal17 Jul 10 '20 at 01:22
  • This is great and got me almost all the way there, except for the part about maybe using `awk` to reassmble. Re-assembling can be done with something like `jq -s '.' < input.json` – Kelstar Aug 29 '23 at 01:38
  • @Kelstar - You cannot use jq to partition STDOUT. That's where a tool such as awk comes in handy. – peak Aug 29 '23 at 06:27
  • @peak - true, you cannot do it in a one-liner, but `awk` seemed awkward. was looking for a way to do it with `jq` in a for loop. seemed like the answer could use this piece of information which i didn't find easily. – Kelstar Aug 31 '23 at 00:45
7

It is possible to slice a json file or stream with jq. See the script below. The sliceSize parameter sets the size of the slices and determines how many inputs are kept in memory at the same time. This allows the memory usage to be controlled.

Input to be sliced

The input does not have to be formatted.

As input is possible:

  • an array of Json inputs
  • a stream of Json inputs

Sliced output

The files can be created with formatted or compact Json

The sliced output files can contain:

  • an array of Json inputs with size=$sliceSize
  • a stream of Json inputs with $sliceSize items

Performance

A quick benchmark shows the time and memory consumption during slicing (measured on my laptop)

file with 100.000 json objects, 46MB

  • sliceSize=5.000 : time=35 sec
  • sliceSize=10.000 : time=40 sec
  • sliceSize=25.000 : time=1 min
  • sliceSize=50.000 : time=1 min 52 sec

file with 1.000.000 json objects, 450MB

  • sliceSize=5000 : time=5 min 45 sec
  • sliceSize=10.000 : time=6 min 51 sec
  • sliceSize=25.000 : time=10 min 5 sec
  • sliceSize=50.000 : time=18 min 46 sec, max memory consumption: ~150 MB
  • sliceSize=100.000 : time=46 min 25 sec, max memory consumption: ~300 MB
#!/bin/bash

SLICE_SIZE=2

JQ_SLICE_INPUTS='
   2376123525 as $EOF |            # random number that does not occur in the input stream to mark the end of the stream
   foreach (inputs, $EOF) as $input
   (
      # init state
      [[], []];                    # .[0]: array to collect inputs
                                   # .[1]: array that has collected $sliceSize inputs and is ready to be extracted
      # update state
      if .[0] | length == $sliceSize   # enough inputs collected
         or $input == $EOF             # or end of stream reached
      then [[$input], .[0]]        # create new array to collect next inputs. Save array .[0] with $sliceSize inputs for extraction
      else [.[0] + [$input], []]   # collect input, nothing to extract after this state update
      end;

      # extract from state
      if .[1] | length != 0
      then .[1]                    # extract array that has collected $sliceSize inputs
      else empty                   # nothing to extract right now (because still collecting inputs into .[0])
      end
   )
'

write_files() {
  local FILE_NAME_PREFIX=$1
  local FILE_COUNTER=0
  while read line; do
    FILE_COUNTER=$((FILE_COUNTER + 1))
    FILE_NAME="${FILE_NAME_PREFIX}_$FILE_COUNTER.json"
    echo "writing $FILE_NAME"
    jq '.'      > $FILE_NAME <<< "$line"   # array of formatted json inputs
#   jq -c '.'   > $FILE_NAME <<< "$line"   # compact array of json inputs
#   jq '.[]'    > $FILE_NAME <<< "$line"   # stream of formatted json inputs
#   jq -c '.[]' > $FILE_NAME <<< "$line"   # stream of compact json inputs
  done
}


echo "how to slice a stream of json inputs"
jq -n '{id: (range(5) + 1), a:[1,2]}' |   # create a stream of json inputs
jq -n -c --argjson sliceSize $SLICE_SIZE "$JQ_SLICE_INPUTS" |
write_files "stream_of_json_inputs_sliced"

echo -e "\nhow to slice an array of json inputs"
jq -n '[{id: (range(5) + 1), a:[1,2]}]' |                  # create an array of json inputs
jq -n --stream 'fromstream(1|truncate_stream(inputs))' |   # remove outer array to create stream of json inputs
jq -n -c --argjson sliceSize $SLICE_SIZE "$JQ_SLICE_INPUTS" |
write_files "array_of_json_inputs_sliced"

output of script

how to slice a stream of json inputs
writing stream_of_json_inputs_sliced_1.json
writing stream_of_json_inputs_sliced_2.json
writing stream_of_json_inputs_sliced_3.json

how to slice an array of json inputs
writing array_of_json_inputs_sliced_1.json
writing array_of_json_inputs_sliced_2.json
writing array_of_json_inputs_sliced_3.json

generated files

array_of_json_inputs_sliced_1.json

[
  {
    "id": 1,
    "a": [1,2]
  },
  {
    "id": 2,
    "a": [1,2]
  }
]

array_of_json_inputs_sliced_2.json

[
  {
    "id": 3,
    "a": [1,2]
  },
  {
    "id": 4,
    "a": [1,2]
  }
]

array_of_json_inputs_sliced_3.json

[
  {
    "id": 5,
    "a": [1,2]
  }
]
jpseng
  • 1,618
  • 6
  • 18