3

I have nested json file and its size is 180MB having upto 280000 entries. My json file data looks like

{ 
"images": [
     {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}, 
     {"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae_a", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
  ],
"annotations": [
    {"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}, 
    {"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
    {"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}, 
    {"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
  ]
}

Note that all the json data is in one line, I posted it in 4 lines for better reading.

My question is that how I can split or divide this json file data into small files or even two files? As my json file is nested having two main category images and annotations. The hierarchy of this file should be same as above in divided files (means images and annotations must be store along with same ID in one file).

For Example: By following above json data, that have 4 entries for images and also 4 entries for annotations, after splitting/dividing into two files the new data in json files should be as given below (2 entries for images and also 2 entries for annotations in each new generated file)

JSON file_1 data:

{ 
"images": [
     {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"} 
  ],
"annotations": [
     {"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}, 
     {"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}
  ]
}

JSON file_2 data

{ 
"images": [
     {"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"} 
  ],
"annotations": [
     {"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}, 
     {"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
  ]
}

I checked many questions on stackoverflow and github but unable to solve my problem. Some solutions are exist but not for nested json data.

Here is json-splitter on github, it can't work for nested json.

Another question on stackoverflow, it can work but only for small files because it is very difficult to provide specific ID or data to delete entries one by one.

I tried below code from this github post

with open(sys.argv[1],'r') as infile:
    o = json.load(infile)
    chunkSize = 4550
    for i in xrange(0, len(o), chunkSize):
        with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
            json.dump(o[i:i+chunkSize], outfile)

but again it can't solve my problem. Where I'm missing something? I know there are many questions and answer about this problem but none of any solution is working in my case because of nested data. I'm new in Python so after a lot of work I'm unable to solve my problem. Looking for valuable suggestions and solutions. Thanks

Erric
  • 123
  • 1
  • 10
  • `My question is that how I can split or divide this json file data into small files or even two files?` Which one is it? And also, can you please tell us how you want to divide it? You can use the `key`-iterator to iterate through the keys. – Swedgin Oct 14 '20 at 13:19
  • 1
    Some read material: https://realpython.com/iterate-through-dictionary-python/ – Swedgin Oct 14 '20 at 13:21
  • @Swedgin thanks for your helpful material, I will read it. As I already defined my question above that is there anyway to split lengthy nested json file into small files? In below comment @Contrean idea is also same that he split nested data but I want to split my json file with nested data. You can say 50% `images` along with 50% `annotations` data. I hope you got my point. – Erric Oct 15 '20 at 01:35
  • Sorry Erric I dont get your point. According your comment, you want from 1 file to 4 files? 50% images, 50% images, 50% annotations, 50% annotations? This `he split nested data but I want to split my json file with nested data` is weird, I dont get what you are trying to do. I propose you make a limited json object (like only 2-4 items in images and annotations) and show how you want to split it. (and use proper intentations for the json object) – Swedgin Oct 15 '20 at 08:37
  • 1
    @Swedgin thanks for your suggestions. I have updated my question with detail info and example. I hope it will show my points clearly. – Erric Oct 15 '20 at 10:15
  • Ah that's more clear. Can we assume that the json data is correct? As in, all id's are present in both keys, they are all equal length, etc. Or do you need checks aswell? – Swedgin Oct 15 '20 at 10:51
  • @Swedgin this data is an example, my main issue is that how we can split? You can say this data is correct as I copied it from file. – Erric Oct 16 '20 at 02:00

2 Answers2

2

The code below will do the split for you.

import json

d = {
    "images": [
        {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae",
         "width": 640, "height": 480, "priority": "high"},
        {"id": 5, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"},
        {"id": 7, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae",
         "width": 640, "height": 480, "priority": "high"},
        {"id": 9, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"},
        {"id": 99, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"}
    ],
    "annotations": [{"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 1},
                    {"id": 5, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
                    {"id": 7, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 1},
                    {"id": 9, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
                    {"id": 99, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
                    ]
}

NUM_OF_ENTRIES_IN_FILE = 2
counter = 0
# assuming the images and annotations lists sorted with the same ids
while (counter + 1) * NUM_OF_ENTRIES_IN_FILE <= len(d['images']):
    temp = {'images': d['images'][counter * NUM_OF_ENTRIES_IN_FILE: (counter + 1) * NUM_OF_ENTRIES_IN_FILE],
            'annotations': d['annotations'][counter * NUM_OF_ENTRIES_IN_FILE: (counter + 1) * NUM_OF_ENTRIES_IN_FILE]}
    with open(f'out_{counter}.json', 'w') as f:
        json.dump(temp, f)
    counter += 1
reminder = len(d['images']) % NUM_OF_ENTRIES_IN_FILE
if reminder > 0:
    reminder = reminder * -1
    counter += 1
    temp = {'images': d['images'][reminder:],
            'annotations': d['annotations'][reminder:]}
    with open(f'out_{counter}.json', 'w') as f:
        json.dump(temp, f)
balderman
  • 22,927
  • 7
  • 34
  • 52
  • While code-only answers might answer the question, you could significantly improve the quality of your answer by providing context for your code, a reason for why this code works, and some references to documentation for further reading. From [answer]: _"Brevity is acceptable, but fuller explanations are better."_ – Pranav Hosangadi Oct 14 '20 at 18:19
  • @balderman I'm unable to understand that why you add `NUM_OF_ENTRIES_IN_FILE` and `counter`. If I run this code I got 3 json files with same data. Furthermore your idea is about to create new json file manually, like I need to copy some data from original file then use in new file... Thanks for this idea and I just used your idea to create new testing file as I want only some data from original file to save processing time etc... This code is enough `temp = {'images': d['images'], 'annotations': d['annotations']}` then just add `with open(f"file.json", "w")......` – Erric Oct 15 '20 at 07:09
  • The data in the 3 files is not the same. Look at the id. – balderman Oct 15 '20 at 07:19
1

I added the print-statements so you know what you know at which step the code is, since it probably takes some time to execute.

import json

print("start")

with open("YOURFILE.json", "r") as f:
    data = json.load(f)

print("loaded")

with open("images.json", "w") as f:
    json.dump(data["images"], f)

print("copied images")

with open("annotations.json", "w") as f:
    json.dump(data["annotations"], f)

print("finished")
Contrean
  • 64
  • 7
  • 1
    This is nice idea to split nested data into single one but as I mentioned above that I want to split data along having nested data in each new file. You can say 50% `images` along with 50% `annotations` data in new file. – Erric Oct 15 '20 at 01:44