5

I have a very large file (20GB+ compressed) called input.json containing a stream of JSON objects as follows:

{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typea"
}
{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typea"
}
{
    "timestamp": "12345",
    "name": "Some name",
    "type": "typeb"
}

I want to split this file into files dependent on their type property: typea.json, typeb.json etc., each containing their own stream of json objects that only have the matching type property.

I've managed to solve this problem for smaller files, however with such a large file I run out of memory on my AWS instance. As I wish to keep memory usage down, I understand I need to use --stream but I'm struggling to see how I can achieve this.

cat input.json | jq -c --stream 'select(.[0][0]=="type") | .[1]' will return me the values of each of the type properties, but how do I use this to then filter the objects?

Any help would be greatly appreciated!

peak
  • 105,803
  • 17
  • 152
  • 177
scrim
  • 53
  • 3
  • How many distinct types? (How many file descriptors do we need to keep open if we're fanning out to multiple processes on a single pass? How many passes would we need, if we did one pass per type?) – Charles Duffy Feb 16 '19 at 20:01
  • 1
    You really *don't* need `--stream` here, btw. That would be needed if your objects were larger, but they're all individually small enough to handle on their own. – Charles Duffy Feb 16 '19 at 20:02
  • BTW, this input isn't (err, wasn't originally) valid JSON at all. `jq .`, fed your example input, fails with `parse error: Invalid literal at line 2, column 14`. – Charles Duffy Feb 16 '19 at 20:07
  • (the only circumstances where I could see `--stream` being needed is if the question misrepresented your data format, and the objects are inside other, larger objects rather than present at top-level). – Charles Duffy Feb 16 '19 at 20:17
  • @CharlesDuffy there's about 5-10 types. Thanks for the help with regards to --stream, seems I misunderstood its use. – scrim Feb 17 '19 at 02:19

2 Answers2

1

Assuming the JSON objects in the file are relatively small (none more than a few MB), you won't need to use the (rather complex) "--stream" command-line option, which is mainly needed when the input is (or includes) a single humungous JSON entity.

There are however several choices still to be made. The main ones are described at Split a JSON file into separate files, these being a multi-pass approach (N or (N+1) calls to jq, where N is the number of output files), and an approach that involves just one call to jq, followed by a call to a program such as awk to perform the actual partitioning into files. Each approach has its pros and cons, but if reading the input file N times is acceptable, then the first approach might be better.

To estimate the total computational resources that will be required, it would probably be a good idea to measure the resources used by running jq empty input.json

(From your brief writeup, it sounds like the memory issue you've run into results primarily from the unzipping of the file.)

peak
  • 105,803
  • 17
  • 152
  • 177
  • Ah. Thanks for the help and links to helpful resources. The unzipping of the file might indeed be causing the memory issue I'll have a look into it and report back. – scrim Feb 17 '19 at 02:21
  • Yes, the memory issue was being caused by the unzipping process. I've managed to get it working, will now attempt both your approach and Charles Duffy's to figure out which is faster in my use case. Thanks again for the help – scrim Feb 18 '19 at 16:46
0

Using jq to split into a NUL-delimited stream of (type, document) pairs, and using native bash (4.1 or later) to write to those documents using a persistent set of file descriptors:

#!/usr/bin/env bash
case $BASH_VERSION in ''|[1-3].*|4.0*) echo "ERROR: Bash 4.1 needed" >&2; exit 1;; esac

declare -A output_fds=( )

while IFS= read -r -d '' type && IFS= read -r -d '' content; do
  if [[ ${output_fds[$type]} ]]; then  # already have a file handle for this output file?
    curr_fd=${output_fds[$type]}       # reuse it, then.
  else
    exec {curr_fd}>"$type.json"        # open a new output file...
    output_fds[$type]=$curr_fd         # and store its file descriptor for use.
  fi
  printf '%s\n' "$content" >&"$curr_fd"
done < <(jq -j '(.type) + "\u0000" + (. | tojson) + "\u0000"')

This never reads more than a few records (admittedly, potentially multiple copies of each) into memory at a time, so it'll work with an arbitrarily large file so long as the records are of reasonable size.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • 1
    I'd happily endorse this answer except that my timings show it is about 10 times slower than the jq+awk solution ... – peak Feb 17 '19 at 04:17
  • A single base-10 order-of-magnitude behind awk is exactly what I aim for when doing text processing in bash, given as bash reads character-by-character while awk does buffered reads. Which is to say -- you take that performance difference to call that unacceptable; I take it to mean that I'm writing near-optimal code for the runtime in use. :) – Charles Duffy Feb 17 '19 at 04:36
  • I didn't say unacceptable. By the way, the ratio gets worse for larger input.json. With about 80,000 objects, my measurement is 17:1. – peak Feb 17 '19 at 04:44
  • That's surprising -- I expect it to be at least an order of magnitude slower than awk; I *don't* expect it to get slower with size. How sure are you of that relationship? – Charles Duffy Feb 17 '19 at 04:49
  • It's true I have only looked closely at two input files so far, but the larger of the two files is only ~60MB, which pales into insignificance compared to the OP's input.json. (By the way, the 17:1 ratio is for 800K objects, not 80K as I mistyped.) – peak Feb 17 '19 at 05:48
  • One thing I could see making a difference in performance is the write cache filling and needing to be flushed before further writes can take place. Once you're testing with large enough files for that flush to factor into the average, though, I wouldn't expect it to get slower still -- and the behavior in question isn't really language-specific, except in that a runtime that buffers its writes more could continue generating output (until those buffers fill) instead of stopping and blocking as early. – Charles Duffy Feb 17 '19 at 15:32
  • Doubling the size again to ~1.6 million objects worsens the ratio slightly to 20.89:1 – peak Feb 17 '19 at 19:20
  • This smells to me like the early 10:1 numbers were taken with empty caches, and that we're moving asymptotically towards a steady-state value which would be measured with write caches full. – Charles Duffy Feb 17 '19 at 20:19