Use Case
I need to split large files (~5G
) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to read the entire JSON blob into memory). The JSON data in each source file is an array of objects.
Unfortunately, the source data is not newline-delimited JSON and in some cases there are no newlines in the files at all. This means I can't simply use the split
command to split the large file into smaller chunks by newline. Here are examples of how the source data is stored in each file:
Example of a source file with newlines.
[{"id": 1, "name": "foo"}
,{"id": 2, "name": "bar"}
,{"id": 3, "name": "baz"}
...
,{"id": 9, "name": "qux"}]
Example of a source file without newlines.
[{"id": 1, "name": "foo"}, {"id": 2, "name": "bar"}, ...{"id": 9, "name": "qux"}]
Here's an example of the desired format for a single output file:
{"id": 1, "name": "foo"}
{"id": 2, "name": "bar"}
{"id": 3, "name": "baz"}
Current Solution
I'm able to achieve the desired result by using jq
and split
as described in this SO Post. This approach is memory efficient thanks to the jq
streaming parser. Here's the command that achieves the desired result:
cat large_source_file.json \
| jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
| split --line-bytes=1m --numeric-suffixes - split_output_file
The Problem
The command above takes ~47 mins
to process through the entire source file. This seems quite slow, especially when compared to sed
which can produce the same output much faster.
Here are some performance benchmarks to show processing time with jq
vs. sed
.
export SOURCE_FILE=medium_source_file.json # smaller 250MB
# using jq
time cat ${SOURCE_FILE} \
| jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
| split --line-bytes=1m - split_output_file
real 2m0.656s
user 1m58.265s
sys 0m6.126s
# using sed
time cat ${SOURCE_FILE} \
| sed -E 's#^\[##g' \
| sed -E 's#^,\{#\{#g' \
| sed -E 's#\]$##g' \
| sed 's#},{#}\n{#g' \
| split --line-bytes=1m - sed_split_output_file
real 0m25.545s
user 0m5.372s
sys 0m9.072s
Questions
- Is this slower processing speed expected for
jq
compared tosed
? It makes sensejq
would be slower given it's doing a lot of validation under the hood, but 4X slower doesn't seem right. - Is there anything I can do to improve the speed at which
jq
can process this file? I'd prefer to usejq
to process files because I'm confident it could seamlessly handle other line output formats, but given I'm processing thousands of files each day, it's hard to justify the speed difference I've observed.