Stream Parse Huge JSON file into small files

Question

I have around 96 gzip of JSON which is over 350 GB of JSON file after unzipping with following structure

{
  "structe": {},
  "beta": {},
  "flow": {
    "1023": {
      "0101": {
        "-LEjllNyHqdHYGntO6vu": {
          "status": "1",
          "t": 1528736191996
        },
        "-LEjllcXKaVOQu3BDpHF": {
          "status": "1",
          "t": 1528736192996
        }
      },
      "0102": {
        "-LEjllNyHqdHYGntO6vu": {
          "status": "1",
          "t": 1528736191996
        },
        "-LEjllcXKaVOQu3BDpHF": {
          "status": "1",
          "t": 1528736192996
        }
      }
    },
    "1024": {
      "0103": {
        "-LEjllNyHqdHYGntO6vu": {
          "lat": 51.128676733981,
          "lng": -113.9318991267252,
          "status": "1",
          "t": 1528736191996
        },
        "-LEjllcXKaVOQu3BDpHF": {
          "lat": 51.128676733981,
          "lng": -113.9318991267252,
          "status": "1",
          "t": 1528736192996
        }
      }
    }
  }
}

I can't load this in RAM , Now I want to stream this file and pull the path flow->1023(let id1)->0101(let id2) into new id1_id2.json file. Any thought how can do this with speed. Output i am looking for is like File name = 1023_0101.json

{
        "-LEjllNyHqdHYGntO6vu": {
          "status": "1",
          "t": 1528736191996
        },
        "-LEjllcXKaVOQu3BDpHF": {
          "status": "1",
          "t": 1528736192996
        }
      }

@James Brown I want the object 0101 data into new json file name as 1023_0101.json. Inside the file the values of 0101 object. Similarly all the node at same level. — Fahad Abid, Oct 16 '19 at 07:45
Please, just edit the original post and add the expected output for that sample data. — James Brown, Oct 16 '19 at 07:54
Can this be of use? https://stackoverflow.com/questions/6886283/how-i-can-i-lazily-read-multiple-json-values-from-a-file-stream-in-python — Pitto, Oct 16 '19 at 07:55
Please clarify what you mean by "I can't load this in RAM". What exactly is "this"? Are you implying that `gunzip -c $gz | jq empty` fails for each of the zipped files, or only some of them? — peak, Oct 16 '19 at 13:40

peak · Answer 1 · 2019-10-16T13:47:38.623

Here's a solution that uses jq's streaming parser to produce a stream consisting of $id1, $id2, and the corresponding value of interest; this stream can then be piped into another tool (e.g. awk if that's convenient) to produce the desired files.

In the following, we use atomize from the jq cookbook:

  def atomize(s):
    fromstream(foreach s as $in ( {previous:null, emit: null};
      if ($in | length == 2) and ($in|.[0][0]) != .previous and .previous != null
      then {emit: [[.previous]], previous: $in|.[0][0]}
      else { previous: ($in|.[0][0]), emit: null}
      end;
      (.emit // empty), $in) ) ;

The main jq program (invoked with --stream -n -c) is then simply:

atomize(inputs)
| select(type == "object" and .flow)
| .flow
| keys_unsorted[] as $id1
| (.[$id1] | keys_unsorted[]) as $id2
| $id1, $id2, .[$id1][$id2]

So for each gzip file, $gz, the pipeline would look like this:

gunzip -c $gz | jq -nc --stream -f program.jq | awk ....

For an example of using awk to produce the desired result, see jq, split a huge json of array and save into file named with a value

Caveat and Addendum

jq's streaming parser avoids using RAM at the cost of speed, so usually using the --stream option is only done as a last resort. From the description of the problem, it looks like you might be able to process some of the zipped files using jq's regular parser, so you might want to process those files speedily, leaving the "atomize" approach for those files that are too big.

Caution

The problem description does not make it clear what should be done if there is an id1_id2.json collision. If there is no possibility of such a collision, then of course there's no problem. Otherwise, it would be up to the program that creates those files to manage that contingency.

Inian · Answer 2 · 2019-10-16T09:03:45.097

1

You can use jq with the --stream option, jq - I/O (Streaming) set, that reads texts in a streaming fashion, allowing programs to start processing large JSON texts immediately rather than after the parse completes (storing entire file in RAM).

Assuming your input id strings are stored in a shell variable context

id1=1023; id2=0101

Pipe the output of your huge gzip to the following filter

jq --arg v1 "$id1" --arg v2 "$id2" --stream 'fromstream(inputs)| objects | .flow[$v1][$v2]' > "$id1"_"$id2".json

(or) if the id names can't be pre-fetched and you need to fetch them on the run, directly use their names as

jq --stream 'fromstream(inputs)| objects | .flow."1023"."0101"'

edited Oct 16 '19 at 09:03

answered Oct 16 '19 at 08:26

Inian

80,270
14
142
161

That'll eat up the same amount (maybe even more) of RAM as `jq '.'` will do – oguz ismail Oct 16 '19 at 08:29
@oguzismail: I really don't have a huge JSON to try this out, but the --stream option should allow parsing the JSON as it sees right? How do you claim it will eat up more – Inian Oct 16 '19 at 08:36
1

`fromstream(inputs)` builds up the original file in internal memory, you can see `jq '.'` and `jq --stream 'fromstream(inputs)'` will emit the same output – oguz ismail Oct 16 '19 at 08:39
1

@oguzismail: Yes it might appear the same, but the very definition of `--stream` option says, the text is parsed immediately, so I'm guessing `jq '.'` takes the whole file into memory but `jq --stream` doesn't – Inian Oct 16 '19 at 08:42
That's not the problem, the problem is, fromstream will reassemble the whole file internally. What's the difference between it and reading the whole file into memory? – oguz ismail Oct 16 '19 at 08:44
So if I understand right, to have the real case of streaming, I need to use `jq --stream 'inputs | ..` rather than `jq --stream 'fromstream(inputs) | ..` . I'm not sure how `fromstreams()` is implemented though – Inian Oct 16 '19 at 08:46
yes. you can see it here https://github.com/stedolan/jq/blob/master/src/builtin.jq – oguz ismail Oct 16 '19 at 08:48
Also, can you look at this https://github.com/stedolan/jq/wiki/FAQ#streaming-json-parser from which I improvised my answer which says `fromstream(inputs)` can be used or maybe I understood it wrong – Inian Oct 16 '19 at 08:52
It still filters `inputs`, and doesn't look like good example to me. I think we should ask @peak, he's one of the maintainers right? – oguz ismail Oct 16 '19 at 08:54
1

@oguzismail: Yes I would wait for him though, also another point, that supports my inference - https://github.com/stedolan/jq/wiki/Cookbook#processing-huge-json-texts – Inian Oct 16 '19 at 08:55
@Inian idea same to right but the things is I dnt know about id1 and id2. In traversing the file these id will appear. So if script with arg convert that will be great – Fahad Abid Oct 16 '19 at 09:00
@FahadAbid: In that case, do `jq --stream 'fromstream(inputs)| objects | .flow."1023"."0101"'` – Inian Oct 16 '19 at 09:02
@Inian - When `inputs` is used with `--stream`, you get all the benefits of jq's streaming parser. Try something like: `jq -nc --stream 'inputs|debug'` to see. – peak Oct 16 '19 at 13:30
@peak : So my attempt is a proper use of the streaming feature in jq? Just wanted to be sure – Inian Oct 16 '19 at 14:28
@OguzIsmail's point about using fromstream(inputs) with --stream is correct. – peak May 08 '20 at 21:37

score 0 · Answer 3 · answered Oct 16 '19 at 08:03

What first coming on my mind is treating the file like stream and reading it line by line. There are some libraries already which are treating the json files as streams. For example, you can check out the snippet from ijson library:

For JSON like:

{
  "earth": {
    "europe": [
      {"name": "Paris", "type": "city", "info": { ... }},
      {"name": "Thames", "type": "river", "info": { ... }},
      // ...
    ],
    "america": [
      {"name": "Texas", "type": "state", "info": { ... }},
      // ...
    ]
  }
}

Treatment would look like:

import ijson

parser = ijson.parse(urlopen('http://.../'))
stream.write('<geo>')
for prefix, event, value in parser:
    if (prefix, event) == ('earth', 'map_key'):
        stream.write('<%s>' % value)
        continent = value
    elif prefix.endswith('.name'):
        stream.write('<object name="%s"/>' % value)
    elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
        stream.write('</%s>' % continent)
stream.write('</geo>')

Stream Parse Huge JSON file into small files

3 Answers3

Caveat and Addendum

Caution

Linked