8

Suppose we have a JSON array of length 5 and we want to split the array into multiple arrays of length 2 and save the grouped items into different files, using linux command line tools.

I tried it by using the jq and split tools (I am happy with any approach that can be executed from a bash script):

$ echo '[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]' | jq -c -M '.[]' | split -l 2 -d -a 3 - meta_
$ tail -n +1 meta_*
==> meta_000 <==
{"key1":"value1"}
{"key2":"value2"}

==> meta_001 <==
{"key3":"value3"}
{"key4":"value4"}

==> meta_002 <==
{"key5":"value5"}

The previous command saves the items into the files correctly, but we need to convert them into a valid JSON array format. I tired with --filter option:

$ echo '[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]' | jq -c -M '.[]' | split -l 2 -d -a 3 - meta2_ --filter='jq --slurp -c -M'
[{"key1":"value1"},{"key2":"value2"}]
[{"key3":"value3"},{"key4":"value4"}]
[{"key5":"value5"}]
$ tail -n +1 meta2_*
tail: cannot open 'meta2_*' for reading: No such file or directory

However, it displays the output on the screen but the results aren't persisted. I tried forwarding the output but I get an error:

echo '[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]' | jq -c -M '.[]' | split -l 2 -d -a 3 - meta2_ --filter='jq --slurp -c -M > $FILE'
...
split: with FILE=meta2_000, exit 2 from command: jq --slurp -c -M > $FILE

Any hints or better approaches?

EDIT: I tried with double quotes @andlrc suggested:

$ echo '[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]' | jq -c -M '.[]' | split -l 2 -d -a 3 - meta2_ --filter="jq --slurp -c -M > $FILE"
bash: -c: line 0: syntax error near unexpected token `newline'
bash: -c: line 0: `jq --slurp -c -M > '
split: with FILE=meta2_000, exit 1 from command: jq --slurp -c -M >
$ cat meta_000 | jq --slurp -c -M
[{"key1":"value1"},{"key2":"value2"}]
Emer
  • 3,734
  • 2
  • 33
  • 47
  • ...you won't accept any answer that doesn't use `split`? (Which is to say: Please avoid putting prejudices about which tools are the best way to answer a question into the question itself). – Charles Duffy Nov 02 '16 at 17:52
  • @CharlesDuffy I can accept and answer without using split, thanks for the advise – Emer Nov 02 '16 at 18:14

5 Answers5

9

It'll be easier to build out the arrays in the jq filter, then split to files per line. No additional filtering necessary.

range(0; length; 2) as $i | .[$i:$i+2]

produces:

[{"key1":"value1"},{"key2":"value2"}]
[{"key3":"value3"},{"key4":"value4"}]
[{"key5":"value5"}]

So putting it all together.

$ jq -cM --argjson sublen '2' 'range(0; length; $sublen) as $i | .[$i:$i+$sublen]' \
    input.json | split -l 1 -da 3 - meta2_
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
2

I found out the solution using jq and split tools. I was missing the double quotes, the '.' pattern in jq and to scape the $ with a backslash.

$ echo '[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]' |
  jq -c -M '.[]' |
  split -l 2 -d -a 3 - meta2_ --filter="jq --slurp -c -M '.' >\$FILE"
$ tail -n +1 meta2_*
==> meta2_000 <==
[{"key1":"value1"},{"key2":"value2"}]

==> meta2_001 <==
[{"key3":"value3"},{"key4":"value4"}]

==> meta2_002 <==
[{"key5":"value5"}]
peak
  • 105,803
  • 17
  • 152
  • 177
Emer
  • 3,734
  • 2
  • 33
  • 47
2

Suppose we have a JSON array of length 5 and we want to split the array into multiple arrays of length 2 and save the grouped items into different files, using linux command line tools.

The JSON parser can do what you want and an XQuery 3.1 FLWOR expression with a (tumbling) Window clause is the basic idea:

$ xidel -se '
  for tumbling window $w in 1 to 5
  start $s when $s mod 2 eq 1
  return
  join($w)
'
1 2
3 4
5

$ echo '[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]' | \
  xidel -se '
  for tumbling window $w in 1 to count($json())
  start $s when $s mod 2 eq 1
  return
  array{$w ! $json(.)}
' --output-json-indent=compact
[{"key1": "value1"}, {"key2": "value2"}]
[{"key3": "value3"}, {"key4": "value4"}]
[{"key5": "value5"}]

To save each array as a json-file you can use Xidel's integrated EXPath File Module:

$ xidel -se '
  for tumbling window $w in 1 to 5
  start $s at $i when $s mod 2 eq 1
  count $i
  return
  x"output_{$i}.json - {join($w)}"
'
output_1.json - 1 2
output_2.json - 3 4
output_3.json - 5

$ echo '[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]' | \
  xidel -se '
  for tumbling window $w in 1 to count($json())
  start $s at $i when $s mod 2 eq 1
  count $i
  return
  file:write(
    x"output_{$i}.json",
    array{$w ! $json(.)},
    {"method":"json"}
  )
'

$ xidel -s output_1.json output_2.json output_3.json -e '$raw'
$ xidel -s output_1.json output_2.json output_3.json -e '$json' --output-json-indent=compact
[{"key1":"value1"},{"key2":"value2"}]
[{"key3":"value3"},{"key4":"value4"}]
[{"key5":"value5"}]
Reino
  • 3,203
  • 1
  • 13
  • 21
1

jq might be the way to go as mentioned in the other responses. As I was not familiar with jq, I wrote a bash script (splitjson.sh) below using very common commands (echo, cat, wc, head, tail, sed, expr). This script splits the json file into chunks no longer than the specified number of bytes. If the split is not possible within the specified number of bytes (a json item is very long or the specified maximum number of bytes per chunk is too small), the script stops writing to the json files and writes an error.

Here is an example with the data in the question as example.json:

[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]

The command to execute the script with a maximum number of bytes per chunk is:

$ ./splitjson.sh example.json 40

The result is then:

$ head example.json.*
==> example.json.0 <==
[{"key1":"value1"},{"key2":"value2"}]
==> example.json.1 <==
[{"key3":"value3"},{"key4":"value4"}]
==> example.json.2 <==
[{"key5":"value5"}]

The script handles the cases with spaces, tabs, newlines between the end brackets '}', the colon ',' and the start bracket '{'.

I used this script with success on json files as big as 82 MB. I'd expect it to work with bigger files.

Here is the script (splitjson.sh):

#!/bin/bash
if [ $# -ne 2 ]
then
    echo "usage: $0 file_to_split.json nb_bytes_max_per_split"
    exit 1
fi
if [[ -r $1 ]]
then
    input=$1
    echo "reading from file '$input'"
else
    echo "cannot read from specified input file '$1'"
    exit 2
fi
if [[ $2 = *[[:digit:]]* ]]; then
    maxbytes=$2
    echo "taking maximum bytes '$maxbytes'"
else
    echo "provided maximum number of bytes '$2' is not numeric"
    exit 3
fi

start=0
over=0
iteration=0
inputsize=`cat $input|wc -c`
tailwindow="$input.tail"
echo "input file size: $inputsize"
tmp="$input.tmp"
cp $input $tmp
sed -e ':a' -e 'N' -e '$!ba' -e 's/}[[:space:]]*,[[:space:]]*{/},{/g' -i'.back' $tmp
rm "$tmp.back"
inputsize=`cat $tmp|wc -c`
if [ $inputsize -eq 0 ]; then
    cp $input $tmp
    sed -e 's/}[[:space:]]*,[[:space:]]*{/},{/g' -i'.back' $tmp
    rm "$tmp.back"
fi
inputsize=`cat $tmp|wc -c`
while [ $over -eq 0 ]; do
    output="$input.$iteration"
    if [ $iteration -ne 0 ]; then
                echo -n "[{">$output
    else
                echo -n "">$output
    fi
    tailwindowsize=`expr $inputsize - $start`
    cat $tmp|tail -c $tailwindowsize>$tailwindow
    tailwindowresultsize=`cat $tailwindow|wc -c`
    if [ $tailwindowresultsize -le $maxbytes ]; then
        cat $tailwindow>>$output
        over=1
    else
        cat $tailwindow|head -c $maxbytes|sed -E 's/(.*)\},\{(.*)/\1}]/'>>$output
    fi
    jsize=`cat $output|wc -c`
    start=`expr $start + $jsize`
    if [ $iteration -eq 0 ]; then
        start=`expr $start + 1`
    else
        start=`expr $start - 1`
    fi
    endofj=`cat $output|tail -c 3`
    if [ $over -ne 1 ]; then
        if [ ${endofj:1:2} != "}]" ]; then
            if [ ${endofj:0:2} != "}]" ]; then
                echo -e "ERROR: at least one split pattern wasn't found. Aborting. This could be due to wrongly formatted json or due to a json entry too long compared to the provided maximum bytes. Maybe you should try increasing this parameter?\a"
                exit 4
            fi
        fi
    fi
    jsizefinal=`cat $output|wc -c`
    echo "wrote $jsizefinal bytes of json for iteration $iteration to $output"
    iteration=`expr $iteration + 1`
done
rm $tailwindow
rm $tmp
luvzfootball
  • 710
  • 1
  • 9
  • 21
0

Splitting into two separate jq invocations allows the second one to use the input helper to process only a single piece of input at a time. Using a try helper in the second one lets it gracefully handle incomplete lines, if you don't have two items of input left.


s='[{"key1":"value1"},{"key2":"value2"},{"key3":"value3"},{"key4":"value4"},{"key5":"value5"}]'

jq '.[]' <<<"$s" | \
  jq -c -n 'repeat(input as $i1 | try (input as $i2 | [$i1, $i2]) catch [$i1])?' | \
  split -l 2 -d -a 3 - meta_

...emits, in the first file:

[{"key1":"value1"},{"key2":"value2"}]
[{"key3":"value3"},{"key4":"value4"}]

...and, in the second:

[{"key5":"value5"}]
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441