6

I have several (~300,000) files of individual JSON objects that I want to combine into a single file that is a JSON array. How can I do this on linux assuming they are all in the location "~/data_files"?

FileA

{
  name: "Test",
  age: 23
}

FileB

{
  name: "Foo",
  age: 5
}

FileC

{
  name: "Bar",
  age: 5
}

Example Output: (begins and ends with brackets, and added commas between objects)

[
    {
      name: "Test",
      age: 23
    },
    {
      name: "Foo",
      age: 5
    },
    {
      name: "Bar",
      age: 5
    }
]

What I've tried:

I know I can use cat to combine a bunch of files, not sure how to do it for all files in a directory yet, but trying to figure that out. Also trying to figure out how to have the , between files I'm concatenating, haven't seen a command for it yet.

Andrey
  • 2,503
  • 3
  • 30
  • 39
Don P
  • 60,113
  • 114
  • 300
  • 432

5 Answers5

9

Since you seem a little new to unix I'll try to give you a solution that is simple and doesn't introduce too many new concepts. I'll leave clever and novel to the other posters. This solution will be very efficient since all I'm doing is streaming files into files.

To start with we will create a new file in our home directory with a square bracket in it.
echo "[" > ~/tmp.json

Now we loop through all the files in your data_files directory and append them to our new file. The >> will add them to whats already there. If you used a > then the file would get overwritten each time. The echo will add a comma when the cat has finished outputting the file.
for i in ~/data_files/*; do cat $i;echo ","; done >> ~/tmp.json

So now we have your 300k files in one file called tmp.json, with each entry seperated by a comma, but the last line of the file is also a comma and that is not what we want.
The sed command below behaves like cat except that '$d' tells it to omit the last line of the file.
So we create a new file with all but the last line of our temporary file.
sed '$d' ~/tmp.json > ~/finished.json

We need to close our square bracket
echo "]" >> ~/finished.json

And finally we delete our temporary file rm ~/tmp.json

And we are done.

[
{
    name: "Test",
    age: 23
}
,
{
    name: "Foo",
    age: 5
}
,
{
    name: "Bar",
    age: 5
}
]

A quick glance at this post about pretty printing json will point you at a command line tool that will take your finished.json file and turn it into exactly the output you asked for.

Niall Cosgrove
  • 1,273
  • 1
  • 15
  • 24
  • 1
    Hi @niall-cosgrove, I tried running my adapted version of this line `for i in ~/data_files/*; do cat $i;echo ","; done >> ~/tmp.json` and it just ends up creating an infinite loop with a never ending file. Here's my adapted version: `for f in *.txt; do cat $f;echo ","; done >> masterDoc.txt` Any ideas why this is looping? – GPP Jun 27 '20 at 04:18
  • 1
    @GPP looks like `masterDoc.txt` is in the same directory as all your other text files so I think when the for loop gets that far you are appending its contents to itself forever. If you call it anything that doesn't end in `.txt` you should be fine. f.e. `for f in *.txt; do cat $f;echo ","; done >> masterDoc` That way `f` is never matched with your output file. – Niall Cosgrove Jun 27 '20 at 20:58
  • thanks @NiallCosgrove just confirming that your answer was the solution to the problem. Thanks! – GPP Jul 08 '20 at 18:16
2

a simple for loop and couple of sed will do

$ echo "[" > all; 
  for f in file{A,B,C}; 
  do 
     sed 's/^/\t/;$s/$/,/' "$f" >> all; 
  done; 
  sed -i '$s/,/\n]/' all

$ cat all
[
 {
   name: "Test",
   age: 23
 },
 {
   name: "Foo",
   age: 5
 },
 {
   name: "Bar",
   age: 5
 }
]

or the same to stdout

$ echo "["; for f in file{A,B,C}; do sed 's/^/\t/;$s/$/,/' "$f"; done |
sed `'$s/,/\n]/'`

to run for all files in the directory change file{A,B,C} to *

karakfa
  • 66,216
  • 7
  • 41
  • 56
1

This script should work even if the number of files is 300K+. Also this script is faster than sed solution since input files are not modified.

#!/bin/sh
tmp="/dev/shm/${USER}.find.tmp"
out='all.json'
find . -maxdepth 1 -name file\* > ${tmp}
echo '[' > ${out}
for f in $(head -n -1 ${tmp})
do
  cat ${f} >> ${out}
  echo ',' >> ${out}
done
f=$(tail -n 1 ${tmp})
cat ${f} >> ${out}
echo ']' >> ${out}
rm -f -- ${tmp}
Andrey
  • 2,503
  • 3
  • 30
  • 39
1

And python version for completeness:

import os, sys

dir = sys.argv[1]

print "["
for fn in os.listdir(dir):
    with open(dir + '/'  + fn, 'r') as f:
        read_data = f.read()
        print read_data,
    print ","
print "]"
Vadim Key
  • 1,242
  • 6
  • 15
0

jc.. use jq, it is or should be best practice at the point

$ cat <<eof | jq -s
> { "key": 1 }
> { "key2": 2 }
> { "key3": 3 }
> eof
[
  {
    "key": 1
  },
  {
    "key2": 2
  },
  {
    "key3": 3
  }
]

If your reqs are to JUST push json objects into queue, any other suggestion is naive at best, which is not a statement based on opinion.

christian elsee
  • 111
  • 1
  • 3