how to split large json file according to the value of a key?

Question

I have a large json file that I would like to split according to the key "metadata". One example of record is

{"text": "The primary outcome of the study was hospital mortality; secondary outcomes included ICU mortality and lengths of stay for hospital and ICU. ICU mortality was defined as survival of a patient at ultimate discharge from the ICU and hospital mortality was defined as survival at discharge or transfer from our hospital.", "label": "conclusion", "metadata": "18982114"}

There are many records in the json file where the key "metadata" is "18982114". How can I extract all of these records and store them into a separate json file? Ideally, I'm looking for a solution that includes no loading and looping over the file, otherwise it would be very cumbersome every time I query it. I think by using shell command maybe it's doable, but unfortunately I'm not an expert in shell commands...so I would highly appreciate a non-looping fast query solution, thx!

==========================================================================

here are some samples of the file (contains 5 records):

{"text": "Finally, after an emergency laparotomy, patients who received i.v. vasoactive drugs within the first 24 h on ICU were 3.9 times more likely to die (OR 3.85; 95% CI, 1.64 -9.02; P\u00bc0.002). No significant prognostic factors were determined by the model on day 2.", "label": "conclusion", "metadata": "18982114"}

{"text": "Kinetics ofA TP Binding to Normal and Myopathic", "label": "conclusion", "metadata": "10700033"}

{"text": "Observed rate constants, k0b,, were obtained by fitting the equation I(t)=oe-kobs+C by the method of moments, where I is the observed fluorescence intensity, and I0 is the amplitude of fluorescence change. 38 ", "label": "conclusion", "metadata": "235564322"}

{"text": "The capabilities of modern angiographic platforms have recently improved substantially.", "label": "conclusion", "metadata": "2877272"}

{"text": "Few studies have concentrated specifically on the outcomes after surgery.", "label": "conclusion", "metadata": "18989842"}

The job is to fast retrieve the text for the record with metadata "18982114"

Can you share a sample file that is large enough to test this? You just want all the records Where "metadata" is "18982114" using jq? — Inian, Mar 16 '21 at 10:34
yes exactly, and sure, I'll reedit the question and put the sample file — Yingqiang Gao, Mar 16 '21 at 10:37
The input you've shown is not so big. There shouldn't be an issue with loading the file in-memory and processing it — Inian, Mar 16 '21 at 10:49
how many records you want? I've got 10 GB of data, so I cannot show them all here... — Yingqiang Gao, Mar 16 '21 at 10:51
ok. That's the information I need. So its a 10GB+ file input — Inian, Mar 16 '21 at 10:56

score 0 · Answer 1 · answered Mar 16 '21 at 09:57

0

Use json package to convert the json object into a dictionary then use the data stored in the metadata key. here is an working example:

# importing the module 
import json 

# Opening JSON file 
with open('data.json') as json_file: 
    data = json.load(json_file) 

    # Print the type of data variable 
    print("Type:", type(data)) 

    # Print the data of dictionary 
    print("metadata: ", data['metadata'])

answered Mar 16 '21 at 09:57

Petronella

2,327
1
15
24

is there a way of doing this without loading and looping over the entire file? cause this file is about 10 GB large, so I was wondering if there exists ways of using shell command to do this... – Yingqiang Gao Mar 16 '21 at 09:59
well thenwhat you ask is not python but shell commands; you can use string shell operations, split the object on metadata key, use pattern matching, the lot – Petronella Mar 16 '21 at 10:02

score 0 · Answer 2 · answered Mar 16 '21 at 09:58

0

You can try this approach:

import json


with open('data.json') as data_json: 
    data = json.load(data_json) 
MATCH_META_DATA = '18982114'
match_records = []
for part_data in data:
    if part_data.get('metadata') == MATCH_META_DATA:
        match_records.append(part_data)

answered Mar 16 '21 at 09:58

Andrei Gurko

178
5

is there a way of using shell command and not loop over the entire file? – Yingqiang Gao Mar 16 '21 at 10:05
You should process data from file and I think you need to open the file by reading. If yours file is a huge you need to read part of data – Andrei Gurko Mar 16 '21 at 10:11

score 0 · Answer 3 · edited Mar 19 '21 at 21:43

0

Let us imagine we have the following JSON content in example.json:

{
    "1":{"text": "Some text 1.", "label": "xxx", "metadata": "18982114"},
    "2":{"text": "Some text 2.", "label": "yyy", "metadata": "18982114"},
    "3":{"text": "Some text 3.", "label": "zzz", "metadata": "something else"}
}

You can do the following:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json

# 1. read json content from file
my_json = None
with open("example.json", "r") as file:
  my_json = json.load(file)

# 2. filter content
#    you can use a list instead of a new dictionary if you don't want to create a new json file
new_json_data = {}
for record_id in my_json:
    if my_json[record_id]["metadata"] == str(18982114):
        new_json_data[record_id] = my_json[record_id]

# 3. write a new json with filtered data
with open("result.json"), "w") as file:
    json.dump(new_json_data, file)

This will output the following result.json file:

{"1": {"text": "Some text 1.", "label": "", "metadata": "18982114"}, "2": {"text": "Some text 2.", "label": "", "metadata": "18982114"}}

edited Mar 19 '21 at 21:43

marc_s

732,580
175
1,330
1,459

answered Mar 16 '21 at 10:03

SpaceBurger

537
2
12

is it doable without loop over the entire file? cause I have to query this file many times, and loop over a 10 GB file is cumbersome... – Yingqiang Gao Mar 16 '21 at 10:06
you may want to look at `jq` or `jqawk` shell commands (otherwise you would have to implement json stream parsing in python, unless there is a library that can help...) ; see https://stackoverflow.com/questions/44215931/split-json-into-multiple-files and and https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions (see select operator) – SpaceBurger Mar 16 '21 at 10:19
You can get the indexes of the records you want with `jq --stream '. | select(length == 2) | select(.[0][1] == "metadata") | select(.[1] == "18982114") | .[0][0]' example.json` (or `jq '.[] | select(.metadata == "18982114")'` if you are not using streaming). However, I am not sure how you can get the corresponding records with it, maybe combining it with some python. – SpaceBurger Mar 16 '21 at 12:13

how to split large json file according to the value of a key?

3 Answers3