How to merge thousands of json documents in Bash?

Question

I created more than 500 000 JSON documents through a script connecting to some API. I wanted to import these documents into RethinkDB, but it seems that RethinkDB cannot import files massively, so I thought about merging all these files into a big JSON file (say bigfile.json). Here is their structure :

file 1.json:

{
  "key_1": "value_1.1",
  "key_2": "value_1.2",
  "key_3": "value_1.3",
    ...
  "key_n": "value_1.n"
}

file 2.json:

{
  "key_1": "value_2.1",
  "key_2": "value_2.2",
  "key_3": "value_2.3",
    ...
  "key_n": "value_2.n"
}
...

file n.json:

{
  "key_1": "value_n.1",
  "key_2": "value_n.2",
  "key_3": "value_n.3",
    ...
  "key_n": "value_n.n"
}

I was wondering which would be the best structure to create a big JSON file (to be complete, each file has a specific name composed by 3 variables, the first one being a timestamp (YYYYMMDDHHMMSS)), and which command or script (until now I only wrote scripts for bash...) would allow me to produce the merging.

How should the output file look? What has the name of the file got to do with it - why is that important? Does it appear within the files or should it? — Mark Setchell, Mar 15 '16 at 09:10
The output would be a big json file. I thought about something looking like {"bigfile":[file_1,file_2,...,file_n]}, but I'm have no clue if it's the best structure for a big file (output would be more than 1 giga). The names of the files don't appear within the files, but I thought maybe I should make them appear in the big file, since they describe input files partially. — crocefisso, Mar 15 '16 at 09:17
You aren't helping much. `yes > file.json` makes a big JSON file. — Mark Setchell, Mar 15 '16 at 09:42
Ok sorry, I edited the post. I thought it was clear since Rethink only deals with JSON and that I was wondering about the "best structure to create a big JSON file" — crocefisso, Mar 15 '16 at 09:49
Did you look at `rethinkdb import`? https://rethinkdb.com/docs/importing/ — dalanmiller, Mar 15 '16 at 15:46

aleneum · Accepted Answer · 2016-03-15T10:33:58.680

You mentioned bash, so I assume you use a *nix where you can use echo, cat and sed to achieve what you want.

$ ls   
file1.json  file2.json  merge_files.sh  output
$ cat file1.json 
{
    "key_1": "value_1.1",
    "key_2": "value_1.2",
    "key_3": "value_1.3",
    "key_n": "value_1.n"
}
$ ./merge_files.sh
$ cat output/out.json
{
"file1":
{
  "key_1": "value_1.1",
  "key_2": "value_1.2",
  "key_3": "value_1.3",
  "key_n": "value_1.n"
},
"file2":
{
  "key_1": "value_2.1",
  "key_2": "value_2.2",
  "key_3": "value_2.3",
  "key_n": "value_2.n"
}
}

The script below reads in all json files in a folder and concatenates them into a 'big' file with the filename as the key.

#!/bin/bash

# create the output directory (if it does not exist)
mkdir -p output
# remove result from previous runs
rm output/*.json
# add first opening bracked
echo { >> output/tmp.json
# use all json files in current folder
for i in *.json
do 
    # first create the key; it is the filename without the extension
    echo \"$i\": | sed 's/\.json//' >> output/tmp.json
    # dump the file's content
    cat "$i" >> output/tmp.json
    # add a comma afterwards
    echo , >>  output/tmp.json
done
# remove the last comma from the file; otherwise it's not valid json
cat output/tmp.json | sed '$ s/.$//' >> output/out.json
# remove tempfile
rm output/tmp.json
# add closing bracket
echo } >> output/out.json

`for i in *.json` would be preferable. Also, you should double-quote `$i` so it looks like this `cat "$i" >> ...`. — Mark Setchell, Mar 15 '16 at 10:22
Wow! Thank you so much. Not only it worked perfectly but I understood everything thanks to your teaching skills. Best answer ever! Just one question, what exactly ".$" is in sed '$ s/.$//' (how can it remove comma if comma is not mentioned, is it removing the last string?)? — crocefisso, Mar 15 '16 at 11:13
@crocefisso, almost correct. It is a regular expression which removes the last character of the last line of input. Have a look [here](http://stackoverflow.com/a/27327973/1617563) — aleneum, Mar 15 '16 at 11:27

crocefisso · Answer 2 · 2016-03-19T01:30:25.710

1

Can be done with a single command line on linux. From the directory where all the json files are, create a new directory (say "output"), then launch

jsonlint -v -f *.json > output/bigfile.json

jsonlint source

Jsonlint manual for ubuntu

edited Mar 19 '16 at 01:30

answered Mar 18 '16 at 23:52

crocefisso

793
2
14
29

1

Please add an explanation, and state where `jsonlint` can be obtained, and what platforms it is supported on. – mklement0 Mar 19 '16 at 00:15
1

Hopefully it works for thousands of files, but I guess it doesn't work with in my case 100k to millions of files. `argument list too long: jsonlint` – UberMario Jul 16 '18 at 03:11

score 1 · Answer 3 · answered Nov 13 '17 at 14:34

1

If you ever need to read a bunch of JSON files into memory as a single object with the filenames as keys and the contents as the corresponding values, consider using jq:

jq -n '[inputs|{(input_filename):.}]|add' FILE...

answered Nov 13 '17 at 14:34

peak

105,803
17
152
177

How to merge thousands of json documents in Bash?

3 Answers3