21

Use Case

I need to split large files (~5G) of JSON data into smaller files with newline-delimited JSON in a memory efficient way (i.e., without having to read the entire JSON blob into memory). The JSON data in each source file is an array of objects.

Unfortunately, the source data is not newline-delimited JSON and in some cases there are no newlines in the files at all. This means I can't simply use the split command to split the large file into smaller chunks by newline. Here are examples of how the source data is stored in each file:

Example of a source file with newlines.

[{"id": 1, "name": "foo"}
,{"id": 2, "name": "bar"}
,{"id": 3, "name": "baz"}
...
,{"id": 9, "name": "qux"}]

Example of a source file without newlines.

[{"id": 1, "name": "foo"}, {"id": 2, "name": "bar"}, ...{"id": 9, "name": "qux"}]

Here's an example of the desired format for a single output file:

{"id": 1, "name": "foo"}
{"id": 2, "name": "bar"}
{"id": 3, "name": "baz"}

Current Solution

I'm able to achieve the desired result by using jq and split as described in this SO Post. This approach is memory efficient thanks to the jq streaming parser. Here's the command that achieves the desired result:

cat large_source_file.json \
  | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
  | split --line-bytes=1m --numeric-suffixes - split_output_file

The Problem

The command above takes ~47 mins to process through the entire source file. This seems quite slow, especially when compared to sed which can produce the same output much faster.

Here are some performance benchmarks to show processing time with jq vs. sed.

export SOURCE_FILE=medium_source_file.json  # smaller 250MB

# using jq
time cat ${SOURCE_FILE} \
  | jq -cn --stream 'fromstream(1|truncate_stream(inputs))' \
  | split --line-bytes=1m - split_output_file

real    2m0.656s
user    1m58.265s
sys     0m6.126s

# using sed
time cat ${SOURCE_FILE} \
  | sed -E 's#^\[##g' \
  | sed -E 's#^,\{#\{#g' \
  | sed -E 's#\]$##g' \
  | sed 's#},{#}\n{#g' \
  | split --line-bytes=1m - sed_split_output_file

real    0m25.545s
user    0m5.372s
sys     0m9.072s

Questions

  1. Is this slower processing speed expected for jq compared to sed? It makes sense jq would be slower given it's doing a lot of validation under the hood, but 4X slower doesn't seem right.
  2. Is there anything I can do to improve the speed at which jq can process this file? I'd prefer to use jq to process files because I'm confident it could seamlessly handle other line output formats, but given I'm processing thousands of files each day, it's hard to justify the speed difference I've observed.
sal17
  • 465
  • 6
  • 12
  • 1
    jq's "streaming parser" is known to be slow (and maybe slower than need be), but there may be differences between versions of jq. Within the jq realm, the best one can do is to determine which jq version is least slow. If you have time, it would be worth checking 1.5, 1.6, and the `master` version. Needless to say, if you're processing so many files per day, it might be worth trying to have the files available to you in a sensible format. – peak Jul 10 '20 at 03:39
  • 2
    @dmitry - Could jtc be helpful here? – peak Jul 10 '20 at 03:41
  • 4
    some suggestions for `sed` command: avoid `cat`, serves no purpose and takes up time... use `LC_ALL=C` if input is ASCII, will speed up considerably... if input doesn't have newlines at all, using `GNU awk` would be better as it can use a custom record separator... `awk -v RS='}, ' -v ORS='}\n' 'NR==1{sub(/^\[/, "")} RT; END{sub(/]\n/, "\n"); printf "%s", $0}'` (again, use `LC_ALL=C` if input if ASCII) – Sundeep Jul 10 '20 at 04:37
  • @peak I will check against `1.5`, `1.6`, and `master`, then report back.I completely agree on being provided the files in a sensible format. These come from an API we integrate with managed by another company. They're aware of the file format issue and hopefully they'll address it soon but in the meantime, we need to work with what we have. – sal17 Jul 10 '20 at 04:38
  • 1
    @peak, I plan to add _streamed parsing_ in `jtc` only in the next version (and it'll be multi-threaded, so I hope it'll be fast), so currently it's not possible to do it a memory efficient way in `jtc` :( – Dmitry Jul 10 '20 at 17:11
  • The perl [Cpanel::JSON::XS](https://metacpan.org/pod/Cpanel::JSON::XS#EXAMPLES) module has an example of using its incremental parser to read a huge array of objects that can be easily adapted to write those objects to files (Or to just write one per line to standard output and let `split` do the rest as in your example code) – Shawn Jul 11 '20 at 08:39

3 Answers3

8

jq's streaming parser (the one invoked with the --stream command-line option) intentionally sacrifices speed for the sake of reduced memory requirements, as illustrated below in the metrics section. A tool which strikes a different balance (one which seems to be closer to what you're looking for) is jstream, the homepage of which is https://github.com/bcicen/jstream

Running the sequence of commands in a bash or bash-like shell:

cd
go get github.com/bcicen/jstream
cd go/src/github.com/bcicen/jstream/cmd/jstream/
go build

will result in an executable, which you can invoke like so:

jstream -d 1 < INPUTFILE > STREAM

Assuming INPUTFILE contains a (possibly ginormous) JSON array, the above will behave like jq's .[], with jq's -c (compact) command-line option. In fact, this is also the case if INPUTFILE contains a stream of JSON arrays, or a stream of JSON non-scalars ...

Illustrative space-time metrics

Summary

For the task at hand (streaming the top-level items of an array):

                  mrss   u+s
jq --stream:      2 MB   447
jstream    :      8 MB   114
jm         :     13 MB   109
jq         :  5,582 MB    39

In words:

  1. space: jstream is economical with memory, but not as much as jq's streaming parser.

  2. time: jstream runs slightly slower than jq's regular parser but about 4 times faster than jq's streaming parser.

Interestingly, space*time is about the same for jstream and jq's streaming parser.

Characterization of the test file

The test file consists of an array of 10,000,000 simple objects:

[
{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
,{"key_one": 0.13888342355537053, "key_two": 0.4258700286271502, "key_three": 0.8010012924267487}
...
]
$ ls -l input.json
-rw-r--r--  1 xyzzy  staff  980000002 May  2  2019 input.json

$ wc -l input.json
 10000001 input.json

jq times and mrss

$ /usr/bin/time -l jq empty input.json
       43.91 real        37.36 user         4.74 sys
4981452800  maximum resident set size

$ /usr/bin/time -l jq length input.json
10000000
       48.78 real        41.78 user         4.41 sys
4730941440  maximum resident set size

/usr/bin/time -l jq type input.json
"array"
       37.69 real        34.26 user         3.05 sys
5582196736  maximum resident set size

/usr/bin/time -l jq 'def count(s): reduce s as $i (0;.+1); count(.[])' input.json
10000000
       39.40 real        35.95 user         3.01 sys
5582176256  maximum resident set size

/usr/bin/time -l jq -cn --stream 'fromstream(1|truncate_stream(inputs))' input.json | wc -l
      449.88 real       444.43 user         2.12 sys
   2023424  maximum resident set size
 10000000

jstream times and mrss

$ /usr/bin/time -l jstream -d 1 < input.json > /dev/null
       61.63 real        79.52 user        16.43 sys
   7999488  maximum resident set size

$ /usr/bin/time -l jstream -d 1 < input.json | wc -l
       77.65 real        93.69 user        20.85 sys
   7847936  maximum resident set size
 10000000

jm times and mrss

jm is another command-line utility for "streaming" JSON arrays and objects, using minimal memory.

$ /usr/bin/time -lp jm --count input.json
       real 110.75       user 107.67       sys 1.22
13 295 616  maximum resident set size
 7 303 168  peak memory footprint
 10000000

peak
  • 105,803
  • 17
  • 152
  • 177
4

Restrictions

In the general case, JSON needs parsing with a tool that can understand JSON. You could make an exception and follow these suggestions, only if you are sure that:

  • You have an array with flat JSON objects (like in the use case) without nested objects.

  • Curly braces do not exist anywhere inside the objects, that means you don't have any content like this: {id:1, name:"foo{bar}"}.


Use the shell

If the above conditions are met, you can use the shell to convert to JSONL and split to smaller files, and it would be many times faster than JSON parsing or full text-processing. Additonally it can be almost memoryless, especially if you use core-utils with or without some sed or awk.

Even the simpler approach:

grep -o '{[^}]*}' file.json

will be faster, but will need some memory (less than jq).

And the sed commands you have tried are fast, but need memory, because sed, the stream editor, is reading line by line, and if the file has no newlines at all, it will load all of it into memory, sed needs 2-3 times the size of the maximum line of the stream. But if you first split the stream with newlines, using core-utils like tr, cut etc, then memory usage is extremely low, with great perfomance.


Solution

After some testing, I found this one to be faster and memoryless. Besides that, it doesn't depend on the extra characters outside the objects, like comma and a few spaces, or comma alone etc. It will only match the objects {...} and print each of them to a new line.

#!/bin/sh -
LC_ALL=C < "$1" cut -d '}' -f1- --output-delimiter="}"$'\n' |\
    cut -sd '{' -f2 | sed 's/^/{/' > "$2"

to split the JSONL, use -l rather than -c, to ensure you don't split any object, use something like this:

split -l 1000 -d --additional-suffix='.json' - path/to/file/prefix

or all together

#!/bin/sh -
n=1000
LC_ALL=C < "$1" cut -d '}' -f1- --output-delimiter="}"$'\n' |\
    cut -sd '{' -f2 | sed 's/^/{/' |\
    split -l "$n" -d --additional-suffix='.json' - "$2"

Usage:

sh script.sh input.json path/to/new/files/output

will create files output1.json, output2.json etc in the selected path.

Note: If your stream contains non UTF-8 multi-bute characters, remove LC_ALL=C, it is just a small speed optimization which is not necessary.

Note: I have assumed input with no newlines at all, or with newlines like in your first use case. To generalize and include any newlines anywhere in the file, I add a small modification. In this version tr will truncate all newlines initially, with almost no impact to perfomance:

#!/bin/sh -
n=1000
LC_ALL=C < "$1" tr -d $'\n' |\
    cut -d '}' -f1- --output-delimiter="}"$'\n' |\
    cut -sd '{' -f2 | sed 's/^/{/' > "$2"

Testing

Here are some testing results. They are representative, times were similar for all executions.

Here is the script I used, with input for various values of n:

#!/bin/bash

make_json() {
    awk -v n=2000000 'BEGIN{
        x = "{\"id\": 1, \"name\": \"foo\"}"
        printf "["
        for (i=1;i<n;i++) { printf x ", " }
        printf x"]"
    }' > big.json
    return 0
}

tf="Real: %E  System: %S  User: %U  CPU%%: %P  Maximum Memory: %M KB\n"
make_json

for i in {1..7}; do
    printf "\n==> "
    cat "${i}.sh"
    command time -f "$tf" sh "${i}.sh" big.json "output${i}.json"
done

I used small files when testing together with jq because it gets early into swap. Then with larger files using only the efficient solutions.

==> LC_ALL=C jq -c '.[]' "$1" > "$2"
Real: 0:16.26  System: 1.46  User: 14.74  CPU%: 99%  Maximum Memory: 1004200 KB


==> LC_ALL=C jq length "$1" > /dev/null
Real: 0:09.19  System: 1.30  User: 7.85  CPU%: 99%  Maximum Memory: 1002912 KB


==> LC_ALL=C < "$1" sed 's/^\[//; s/}[^}]*{/}\n{/g; s/]$//' > "$2"
Real: 0:02.21  System: 0.33  User: 1.86  CPU%: 99%  Maximum Memory: 153180 KB


==> LC_ALL=C < "$1" grep -o '{[^}]*}' > "$2"
Real: 0:02.08  System: 0.34  User: 1.71  CPU%: 99%  Maximum Memory: 103064 KB


==> LC_ALL=C < "$1" awk -v RS="}, {" -v ORS="}\n{" '1' |\
    head -n -1 | sed '1 s/^\[//; $ s/]}$//' > "$2"
Real: 0:01.38  System: 0.32  User: 1.52  CPU%: 134%  Maximum Memory: 3468 KB


==> LC_ALL=C < "$1" cut -d "}" -f1- --output-delimiter="}"$'\n' |\
    sed '1 s/\[//; s/^, //; $d;' > "$2"
Real: 0:00.94  System: 0.24  User: 0.99  CPU%: 131%  Maximum Memory: 3488 KB


==> LC_ALL=C < "$1" cut -d '}' -f1- --output-delimiter="}"$'\n' |\
    cut -sd '{' -f2 | sed 's/^/{/' > "$2"
Real: 0:00.63  System: 0.28  User: 0.86  CPU%: 181%  Maximum Memory: 3448 KB

# Larger files testing

==> LC_ALL=C < "$1" grep -o '{[^}]*}' > "$2"
Real: 0:20.99  System: 2.98  User: 17.80  CPU%: 99%  Maximum Memory: 1017304 KB


==> LC_ALL=C < "$1" awk -v RS="}, {" -v ORS="}\n{" '1' |\
    head -n -1 | sed '1 s/^\[//; $ s/]}$//' > "$2"
Real: 0:16.44  System: 2.96  User: 15.88  CPU%: 114%  Maximum Memory: 3496 KB


==> LC_ALL=C < "$1" cut -d "}" -f1- --output-delimiter="}"$'\n' |\
    sed '1 s/\[//; s/^, //; $d;' > "$2"
Real: 0:09.34  System: 1.93  User: 10.27  CPU%: 130%  Maximum Memory: 3416 KB


==> LC_ALL=C < "$1" cut -d '}' -f1- --output-delimiter="}"$'\n' |\
    cut -sd '{' -f2 | sed 's/^/{/' > "$2"
Real: 0:07.22  System: 2.79  User: 8.74  CPU%: 159%  Maximum Memory: 3380 KB

thanasisp
  • 5,855
  • 3
  • 14
  • 31
1

I think the answer by thanasisp is great and covers a lot of the problems. The use of cut is very clever to address the sed memory problems.

the sed commands you have tried are fast, but need memory, because sed, the stream editor, is reading line by line, and if the file has no newlines at all, it will load all of it into memory, sed needs 2-3 times the size of the maximum line of the stream

However, as noted in the answer, that sed script only works for very simple JSON objects (no nested objects and no } anywhere except to denote end of first level objects.

A more advanced sed script

This can improved with a more sophisticated sed script which in addition to the patterns posted handles.

[{"id": 1, "name": "foo"}
,{"id": 2, "name": "bar"}
,{"id": 3, "name": "baz"}
,{"id":4, "name": 10}
,{"id":5, "name":"\\\" },{"}
,{"id": {"a":6}, "name": 10}]

We achieve this mainly by taking advantage of the fact that {/} and " come in pairs.

json-newline-json.sed

#!/bin/sed -nf

# Skip empty lines
/^$/d

# From first line to first line starting with [
0 , /\[/ {
# Replace opening [ if exists
# + stripping leading whitespace
    /^[[:space:]]*\[/  s@^[[:space:]]*\[@,@
}

# Line starts with comma
/^\,/ {
# Strip it  
}

# Start of loop
: x

# Save to hold
h

# delete all chars except " and \
s@[^"\\]@@g
# Delete all reverse solidi and non-escaped " chars
s@\(\\"\|\\\)@@g

# Even match
/^\(""\)\+$/ {
    # Fetch hold
    g
    # Delete everything between ".."
    s@"[^"]*"@@g
    # Delete all chars except {}
    s@[^{}]@@g
    # Match even {} pairs
    /^\([{}][{}]\)\+$/ {
        # The hold space contains our assembled ,{..} object
        g
        # Strip the leading comma
        s@^\,@@
        # Print
        p
        # Skip to next line
        d
    }
    # The hold space contains our partial ,{.. object
    g
    N
    s@\n@@
    t x
}

# Odd match
/^\(""\)*"$/ {
    # The hold space contains our partial ,{.." object
    g
    # Fetch the next line to append
    N
    # Delete the newline added by N (append without newline)
    s@\n@@
    # Restart loop x
    t x
}

Combining with cut we do

< huge.json cut -d '}' -f1- --output-delimeter="}"$'\n' |\
 json-newline-json.sed |\
 split ...

NB

If you have relatively well defined (and non weird) JSON input, solving your problem with sed is powerful, portable and fast. However, it's precarious. The downside is that neither the input or output is validated. Parsing serialized formats is best done with more care. So for most intents and purposes I'd probably stick with jq.

CervEd
  • 3,306
  • 28
  • 25