-1

I created more than 500 000 JSON documents through a script connecting to some API. I wanted to import these documents into RethinkDB, but it seems that RethinkDB cannot import files massively, so I thought about merging all these files into a big JSON file (say bigfile.json). Here is their structure :

file 1.json:

{
  "key_1": "value_1.1",
  "key_2": "value_1.2",
  "key_3": "value_1.3",
    ...
  "key_n": "value_1.n"
}

file 2.json:

{
  "key_1": "value_2.1",
  "key_2": "value_2.2",
  "key_3": "value_2.3",
    ...
  "key_n": "value_2.n"
}
...

file n.json:

{
  "key_1": "value_n.1",
  "key_2": "value_n.2",
  "key_3": "value_n.3",
    ...
  "key_n": "value_n.n"
}

I was wondering which would be the best structure to create a big JSON file (to be complete, each file has a specific name composed by 3 variables, the first one being a timestamp (YYYYMMDDHHMMSS)), and which command or script (until now I only wrote scripts for bash...) would allow me to produce the merging.

crocefisso
  • 793
  • 2
  • 14
  • 29
  • How should the output file look? What has the name of the file got to do with it - why is that important? Does it appear within the files or should it? – Mark Setchell Mar 15 '16 at 09:10
  • The output would be a big json file. I thought about something looking like {"bigfile":[file_1,file_2,...,file_n]}, but I'm have no clue if it's the best structure for a big file (output would be more than 1 giga). The names of the files don't appear within the files, but I thought maybe I should make them appear in the big file, since they describe input files partially. – crocefisso Mar 15 '16 at 09:17
  • You aren't helping much. `yes > file.json` makes a big JSON file. – Mark Setchell Mar 15 '16 at 09:42
  • Ok sorry, I edited the post. I thought it was clear since Rethink only deals with JSON and that I was wondering about the "best structure to create a big JSON file" – crocefisso Mar 15 '16 at 09:49
  • Did you look at `rethinkdb import`? https://rethinkdb.com/docs/importing/ – dalanmiller Mar 15 '16 at 15:46
  • Sure I did, why are you asking ? – crocefisso Mar 16 '16 at 12:48

3 Answers3

7

You mentioned bash, so I assume you use a *nix where you can use echo, cat and sed to achieve what you want.

$ ls   
file1.json  file2.json  merge_files.sh  output
$ cat file1.json 
{
    "key_1": "value_1.1",
    "key_2": "value_1.2",
    "key_3": "value_1.3",
    "key_n": "value_1.n"
}
$ ./merge_files.sh
$ cat output/out.json
{
"file1":
{
  "key_1": "value_1.1",
  "key_2": "value_1.2",
  "key_3": "value_1.3",
  "key_n": "value_1.n"
},
"file2":
{
  "key_1": "value_2.1",
  "key_2": "value_2.2",
  "key_3": "value_2.3",
  "key_n": "value_2.n"
}
}

The script below reads in all json files in a folder and concatenates them into a 'big' file with the filename as the key.

#!/bin/bash

# create the output directory (if it does not exist)
mkdir -p output
# remove result from previous runs
rm output/*.json
# add first opening bracked
echo { >> output/tmp.json
# use all json files in current folder
for i in *.json
do 
    # first create the key; it is the filename without the extension
    echo \"$i\": | sed 's/\.json//' >> output/tmp.json
    # dump the file's content
    cat "$i" >> output/tmp.json
    # add a comma afterwards
    echo , >>  output/tmp.json
done
# remove the last comma from the file; otherwise it's not valid json
cat output/tmp.json | sed '$ s/.$//' >> output/out.json
# remove tempfile
rm output/tmp.json
# add closing bracket
echo } >> output/out.json
aleneum
  • 2,083
  • 12
  • 29
  • 2
    `for i in *.json` would be preferable. Also, you should double-quote `$i` so it looks like this `cat "$i" >> ...`. – Mark Setchell Mar 15 '16 at 10:22
  • Wow! Thank you so much. Not only it worked perfectly but I understood everything thanks to your teaching skills. Best answer ever! Just one question, what exactly ".$" is in sed '$ s/.$//' (how can it remove comma if comma is not mentioned, is it removing the last string?)? – crocefisso Mar 15 '16 at 11:13
  • 1
    @crocefisso, almost correct. It is a regular expression which removes the last character of the last line of input. Have a look [here](http://stackoverflow.com/a/27327973/1617563) – aleneum Mar 15 '16 at 11:27
  • Nice, Thx aleneum! – crocefisso Mar 15 '16 at 12:38
1

Can be done with a single command line on linux. From the directory where all the json files are, create a new directory (say "output"), then launch

jsonlint -v -f *.json > output/bigfile.json

jsonlint source

Jsonlint manual for ubuntu

crocefisso
  • 793
  • 2
  • 14
  • 29
  • 1
    Please add an explanation, and state where `jsonlint` can be obtained, and what platforms it is supported on. – mklement0 Mar 19 '16 at 00:15
  • 1
    Hopefully it works for thousands of files, but I guess it doesn't work with in my case 100k to millions of files. `argument list too long: jsonlint` – UberMario Jul 16 '18 at 03:11
1

If you ever need to read a bunch of JSON files into memory as a single object with the filenames as keys and the contents as the corresponding values, consider using jq:

jq -n '[inputs|{(input_filename):.}]|add' FILE...
peak
  • 105,803
  • 17
  • 152
  • 177