2

I have a large json file that contains thousands of documents:

[
    {
        "_id": "document1",
        "fields": [ ... ]
    },
    {
        "_id": "document2",
        "fields": [ ... ]
    },
    ...
]

I'd like to split this json file so that each json file contains a single document, and name them accordingly:

document1.json, document2.json, ...

For example, document1.json will contain:

{
    "_id": "document1",
    "fields": [ ... ]
}

I have no knowledge of jq API, and I'm struggling to find an answer (I've find a similar question, but slightly different :( )

Caladbolgll
  • 400
  • 1
  • 3
  • 15
  • Are you familiar with any programming languages such as PHP, etc? – kojow7 Oct 02 '17 at 22:23
  • @kojow7 I'm not familiar with the languages relevant to JavaScript and web applications. My domain is based on Python, MATLAB and C++. – Caladbolgll Oct 03 '17 at 16:15
  • I have added a pseudo-code answer for you below. If you do end up trying out some language-specific code and still cannot get it to work, add it to your answer above, and then leave me a comment letting me know that your question has changed. – kojow7 Oct 03 '17 at 17:05
  • If the file is small enough to put in memory, you can first convert to JSON Lines format and then split using https://www.convertcsv.com/text-split.htm – dataman Mar 10 '22 at 17:06

3 Answers3

4

Here is a Python solution to your problem.

Don't forget to change the in_file_path to the location of your big JSON file.

import json

in_file_path='path/to/file.json' # Change me!

with open(in_file_path,'r') as in_json_file:

    # Read the file and convert it to a dictionary
    json_obj_list = json.load(in_json_file)

    for json_obj in json_obj_list:
        filename=json_obj['_id']+'.json'

        with open(filename, 'w') as out_json_file:
            # Save each obj to their respective filepath
            # with pretty formatting thanks to `indent=4`
            json.dump(json_obj, out_json_file, indent=4)

Side Note: I ran this in Python3, it should work in Python2 as well

Stefan Collier
  • 4,314
  • 2
  • 23
  • 33
1

I ran into this problem today as well, and did some research. Just want to share the resulting Python snippet that lets you also customise the length of split files (thanks to this slicing method).

import os
import json
from itertools import islice

def split_json(
    data_path,
    file_name,
    size_split=1000,
):
    """Split a big JSON file into chunks.
    data_path : str, "data_folder"
    file_name : str, "data_file" (exclude ".json")
    """
    with open(os.path.join(data_path, file_name + ".json"), "r") as f:
        whole_file = json.load(f)

    split = len(whole_file) # size_split

    for i in range(split + 1):
        with open(os.path.join(data_path, file_name + "_"+ str(split+1) + "_" + str(i+1) + ".json"), 'w') as f:
            json.dump(dict(islice(whole_file.items(), i*size_split, (i+1)*size_split)), f)
    return

Update: Then, when you need to combine them together again, use the following code:

json_all = dict()
split = 4         # this is the 1-based actual number of splits

for i in range(1, split+1):
    with open(os.path.join("data_folder", "data_file_" + str(split) + "_" + str(i) + ".json"), 'r') as f:
        json_i = json.load(f)
        json_all.update(json_i)
loikein
  • 129
  • 1
  • 7
-2

While it is true that the JS in JSON stands for JavaScript, JSON is not dependent on JavaScript or other specific programming languages. Most modern programming languages are able to read a JSON file. So what you'll need to do is the following:

  1. Choose the language you are most comfortable with (i.e. Python)
  2. Read the large JSON file
  3. Convert the JSON file to an object specific to your programming language (step 2 and 3 may be combined depending on your language choice from step 1)
  4. Loop through each object in the array
  5. In your loop create a new file with the specified filename.
  6. Also in your loop, save the data from that object to the file
  7. Also in your loop, close the file (steps 5 to 7 may be combined depending on your language)

This is a generic algorithm dependent on the language that you choose in step 1. If you still cannot get it to work trying it out in a specific language add your language-specific code to your question above and we can help you further.

E_net4
  • 27,810
  • 13
  • 101
  • 139
kojow7
  • 10,308
  • 17
  • 80
  • 135