How to split line delimited JSON into many files using linux shell script

Question

I have a huge newline delimited JSON file input.json which like this:

{ "name":"a.txt", "content":"...", "other_keys":"..."}
{ "name":"b.txt", "content":"...", "something_else":"..."}
{ "name":"c.txt", "content":"...", "etc":"..."}
...

How can I split it into multiple text files, where file names are taken from "name" and file content is taken from "content"? Other keys can be ignored. Currently toying with jq tool without luck.

`jq` can collect the objects with the same name and content, but it doesn't have the ability to open and write to arbitrary files. — chepner, Dec 20 '19 at 16:20

peak · Answer 1 · 2019-12-20T19:51:35.387

The key to an efficient, jq-based solution is to pipe the output of jq (invoked with the -c option) to a program such as awk to perform the actual writing of the output files.

jq -c '.name, .content' input.json | 
  awk 'fn {print > fn; close(fn); fn=""; next;}
       {fn=$0; sub(/^"/,"",fn); sub(/"$/,"",fn);}'

Warnings

Blindly relying on the JSON input for the file names has some risks, e.g.

what if the same "name" is specified more than once?
if a file already exists, the above program will simply append to it.

Also, somewhere along the line, the validity of .name as a filename should be checked.

Related answers on SO

This question has been asked and answered on SO in slightly different forms before, see e.g. Split a JSON file into separate files

score -1 · Answer 2 · answered Dec 20 '19 at 16:30

jq doesn't have the output capabilities to create the desired files after grouping the objects; you'll need to use another language with a JSON library. An example using Python:

import json
import fileinput

for line in fileinput.input():  # Read from standard input or filename arguments
    d = json.loads(line)
    with open(d['name'], "a") as f:
        print(d['content'], file=f)

This has the drawback of repeatedly opening and closing each file multiple times, but it's simple. A more complex, but more efficient, example would use an exit stack context manager.

import json
import fileinput
import contextlib

with contextlib.ExitStack() as es:
    files = {}
    for line in fileinput.input():
        d = json.loads(line)
        file_name = d['name']
        if file_name not in files:
            files[file_name] = es.enter_context(open(file_name, "w"))
        print(d['content'], file=files[file_name])

Put briefly, files are opened and cached as they are discovered. Once the loop completes (or in the event of an exception), the exit stack ensures all files previously opened are properly closed.

If there's a chance that there will be too many files to have open simultaneously, you'll have to use the simple-but-inefficient code, though you could implement something even more complex that just keeps a small, fixed number of files open at any given time, reopening them in append mode as necessary. Implementing that is beyond the scope of this answer, though.

score -1 · Answer 3 · answered Dec 20 '19 at 20:08

The following jq-based solution ensures that the output in the JSON files is pretty-printed, but ignores any input object with .content equal to the JSON string: "IGNORE ME":

jq 'if .content == "IGNORE ME" 
    then "Skipping IGNORE ME" | stderr | empty
    else .name, .content, "IGNORE ME" end' input.json |
    awk '/^"IGNORE ME"$/ {close(fn); fn=""; next}
         fn {print >> fn; next}
         {fn=$0; sub(/^"/,"",fn); sub(/"$/,"",fn);}'

How to split line delimited JSON into many files using linux shell script

3 Answers3

Warnings

Related answers on SO