I have nested json file and its size is 180MB having upto 280000 entries. My json file data looks like
{
"images": [
{"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"},
{"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"},
{"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae_a", "width": 640, "height": 480, "priority": "high"},
{"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
],
"annotations": [
{"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
{"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
{"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
{"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
]
}
Note that all the json data is in one line, I posted it in 4 lines for better reading.
My question is that how I can split or divide this json file data into small files or even two files? As my json file is nested having two main category images
and annotations
. The hierarchy of this file should be same as above in divided files (means images
and annotations
must be store along with same ID in one file).
For Example: By following above json data, that have 4 entries for images
and also 4 entries for annotations
, after splitting/dividing into two files the new data in json files should be as given below (2 entries for images
and also 2 entries for annotations
in each new generated file)
JSON file_1 data:
{
"images": [
{"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"},
{"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
],
"annotations": [
{"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
{"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}
]
}
JSON file_2 data
{
"images": [
{"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"},
{"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
],
"annotations": [
{"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
{"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
]
}
I checked many questions on stackoverflow and github but unable to solve my problem. Some solutions are exist but not for nested json data.
Here is json-splitter on github, it can't work for nested json.
Another question on stackoverflow, it can work but only for small files because it is very difficult to provide specific ID or data to delete entries one by one.
I tried below code from this github post
with open(sys.argv[1],'r') as infile:
o = json.load(infile)
chunkSize = 4550
for i in xrange(0, len(o), chunkSize):
with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
but again it can't solve my problem. Where I'm missing something? I know there are many questions and answer about this problem but none of any solution is working in my case because of nested data. I'm new in Python so after a lot of work I'm unable to solve my problem. Looking for valuable suggestions and solutions. Thanks