I have a fairly large JSON file (~8GB) that has the following format:
{
// some fixed fields
"data": [
{
// unimportant fields
"importantField": "some alphanumeric string"
},
{
// same format as previous
},
...
]
}
I want to extract all the values of the importantField
to a separate file and I want to do it automatically.
I tried using this grep
command
grep -E -o '"importantField":"[[:alnum:]]+"' file.json
but the process terminated due to large memory usage (it used more than 80% of my ram at some points and the GUI was unresponsive).
Then I tried to first use the split
command to separate the input into 2GB files
split --bytes 2000000000 file.json out_
and then use the same grep
command from above, and this time it finished fairly quickly for every chunk, in about 30 seconds.
This method where I have to split the input first would be fine for me but the the only problem is automatically checking if the split
command splits the file properly, i.e. not in the middle of importantField
key-pair, since that would result in losing some of the important data.
I'd like to know if there's a better/more elegant way to do this.
What also interests me is why doesn't grep
work on 8GB file but works like a charm on 2GB files. The regex I use for matching doesn't seem to be evil.
My assumption is that it tries to load the whole line first (which uses half of my RAM) and then it uses more memory for it's internal calculations, which causes the system to start using SWAP memory which in turn causes really slow performance before terminating the program (10+ minutes).
Some important info specific for this problem:
- The format of objects inside
data
array will always be the same - The input json is minified, it contains no spaces or new lines
- The input file is static
- I'm obviously interested in extracting all of the important data