Automating data extraction from large json file (~8GB)

Question

I have a fairly large JSON file (~8GB) that has the following format:

{
    // some fixed fields
    "data": [
        {
             // unimportant fields
             "importantField": "some alphanumeric string"
        },
        {
            // same format as previous
        },
        ...
    ]
}

I want to extract all the values of the importantField to a separate file and I want to do it automatically.

I tried using this grep command

grep -E -o '"importantField":"[[:alnum:]]+"' file.json

but the process terminated due to large memory usage (it used more than 80% of my ram at some points and the GUI was unresponsive).

Then I tried to first use the split command to separate the input into 2GB files

split --bytes 2000000000 file.json out_

and then use the same grep command from above, and this time it finished fairly quickly for every chunk, in about 30 seconds.

This method where I have to split the input first would be fine for me but the the only problem is automatically checking if the split command splits the file properly, i.e. not in the middle of importantField key-pair, since that would result in losing some of the important data.

I'd like to know if there's a better/more elegant way to do this.

What also interests me is why doesn't grep work on 8GB file but works like a charm on 2GB files. The regex I use for matching doesn't seem to be evil.

My assumption is that it tries to load the whole line first (which uses half of my RAM) and then it uses more memory for it's internal calculations, which causes the system to start using SWAP memory which in turn causes really slow performance before terminating the program (10+ minutes).

Some important info specific for this problem:

The format of objects inside data array will always be the same
The input json is minified, it contains no spaces or new lines
The input file is static
I'm obviously interested in extracting all of the important data

Those tools like `grep`, `split` are just wrong in this case. Consider using `jq`. — terrorrussia-keeps-killing, Dec 22 '21 at 13:14
I don't think `jq` is a good tool in this case. I don't need to parse the whole json, I just care about the data that matches the pattern `"importantField": "[alphaNum]+"`. Care to elaborate on why you think it's good? — honknoodle, Dec 22 '21 at 13:19
Did you grep the file directly (i.e., `grep file`)? Try piping it instead, like `cat file | grep ` — pepoluan, Dec 22 '21 at 13:22
Let me ask you opposite: why not using problem-oriented tools? Would it be considered elaborating enough? `jq` is designed to process JSON data in (shell) scripting in streaming fashion. It at least allows to specify the JSON path nodes of your interest. The only thing I'm not sure is that how to put a filter on the filtered values using `jq` (I merely have never had a reason of doing so). `grep` does not care the context grammar (BTW do you have the JSON minified so that grep runs out of memory?), `split` neither possibly breaking JSON values in the middle (wasting files anyway). — terrorrussia-keeps-killing, Dec 22 '21 at 13:32
What you might want to use could look like this: `jq -r '.data[].importantField | scan("^[0-9a-zA-Z]+$")' < BIG-FILE.json`. I just found that `jq` supports value filtering, and you could probably build a more sophisticated filter to match your needs. — terrorrussia-keeps-killing, Dec 22 '21 at 13:34
Using non-context aware tools to parse json, html, xml, ... is very often a nightmare. For instance, you can extract too much, too little, several fields with the same value, performance may by worse than with tools designed for the job, etc.. [There's even a famous question about this problem](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Aserre, Dec 22 '21 at 13:36
I agree that using a streaming json parser would likely be the right way to go here; that said, if you do really want to work on plain text, bash's own `read` implementation has `-n` and `-N` options that take of the memory problem, and if for some reason that doesn't fit your need, then the `dd` tool can also stream the data to you in chunks. — Raxi, Dec 22 '21 at 13:37
Alright, thanks for your suggestions guys. I already tried `jq` and it was really slow so I thought that all the json parsing it's doing is probably unnecessary and I can get away with just looking for the regex match. I will try your suggestions and let you know what worked the best. — honknoodle, Dec 22 '21 at 13:47
@honknoodle How long does it take to extract necessary data from your big file at your machine? As others said, regexes can never work here reliably in this case producing multiple false-positives or giving no false-negatives just because they don't use grammar context. They are _regular expressions_, not JSON parsers. If you can do programming (sorry, it's not clear from your profile), then you can try implementing an optimized tool that will do some optimizations and make use of an optimized JSON parser. Still like to know why `jq` is so slow in your case though (because of `scan`?). — terrorrussia-keeps-killing, Dec 22 '21 at 14:07
Okay, I see now that `jq` streams are surprisingly (very?) slow as mentioned here: https://stackoverflow.com/questions/62825963/improving-performance-when-using-jq-to-process-large-files . The guys at that question are suggesting to use other ready-to-use tools. I'm really surprised that `jq` sucks at speed. Well... Nothing to say from me. Perhaps you'll gain some more luck and performance improvements by using one of those tools in the linked question or implementing a custom tool serving your only purpose. Blind trust, just like I do, is sometimes very bad. :P — terrorrussia-keeps-killing, Dec 22 '21 at 14:14
@fluffy Slow speed is the tradeoff for not having to store the entire input in memory at once. — chepner, Dec 22 '21 at 14:25
@chepner I did believe that `jq` can build an efficient filter to filter against the given JSON input. I've just generated a dummy 6.5 Gb JSON file and the above `jq` command is still (!!) running not even producing a single line while I have implemented an alternate JSON stream extractor in Java w/ the Gson library, the latter takes about 35 seconds on my machine to extract the `importantField` property values matching the regex (the Gson parser is not very efficient even in the Java world). — terrorrussia-keeps-killing, Dec 22 '21 at 14:47
I just killed the `jq` instance running more than 15 minutes not even producing a single line. My not-C-C++-Rust-but-slow-Java-slow-Gson implementation took less than a minute to consume the whole JSON file and produce filtered output to /dev/null. Have no idea how `jq` implements streaming, but yeah, now I see that I must admit my suggestion for using `jq` was really bad, and I'll consider finding a faster tool. — terrorrussia-keeps-killing, Dec 22 '21 at 14:53
@fluffy Yes I can program i quite a few languages: c, c++, rust, Java, python, nodejs... would you suggest implementing it in one of these? Also, I tried to use the `--streaming` option with `jq` but the documentation for that one is just way too arcane, I have no idea how to use it. I think I'm going to go with implementing my own solution in rust or java. — honknoodle, Dec 22 '21 at 16:43
@honknoodle I guess any of these supporting streamed reading(/writing) would be fine. I have a very rough solution in Java using Google Gson, as I mentioned above, and you can try tweaking it or reimplement it even in C/C++/Rust probably gaining a notable performance boost. You can find the solution at https://pastebin.com/dzPkLQ8A -- it's going to be expired and then removed in a week. Good luck! — terrorrussia-keeps-killing, Dec 22 '21 at 18:14
please update the question to show the expected output (for the given input); do you want to extract the values for ***ALL*** instances of `importantField` or just *some*? can you confirm the intput is nicely formatted as displayed in your samlpe (ie, all entries sit on a separate line)? also consider updating the question to show the various attempts at code (eg, `grep`, `split`) you've tried (along with the results); one last question: what's with the `streaming` comments ... is this a static file or is it being written to while trying to parse it? — markp-fuso, Dec 22 '21 at 19:17
could you define what you mean by `"automatic"`? does this extraction need to occur after some sort of trigger event? does the extraction need to occur at a specific calendar/clock time? — markp-fuso, Dec 22 '21 at 19:23

Automating data extraction from large json file (~8GB)

0 Answers0