-1

I have a large json file that I would like to split according to the key "metadata". One example of record is

{"text": "The primary outcome of the study was hospital mortality; secondary outcomes included ICU mortality and lengths of stay for hospital and ICU. ICU mortality was defined as survival of a patient at ultimate discharge from the ICU and hospital mortality was defined as survival at discharge or transfer from our hospital.", "label": "conclusion", "metadata": "18982114"}

There are many records in the json file where the key "metadata" is "18982114". How can I extract all of these records and store them into a separate json file? Ideally, I'm looking for a solution that includes no loading and looping over the file, otherwise it would be very cumbersome every time I query it. I think by using shell command maybe it's doable, but unfortunately I'm not an expert in shell commands...so I would highly appreciate a non-looping fast query solution, thx!

==========================================================================

here are some samples of the file (contains 5 records):

{"text": "Finally, after an emergency laparotomy, patients who received i.v. vasoactive drugs within the first 24 h on ICU were 3.9 times more likely to die (OR 3.85; 95% CI, 1.64 -9.02; P\u00bc0.002). No significant prognostic factors were determined by the model on day 2.", "label": "conclusion", "metadata": "18982114"}

{"text": "Kinetics ofA TP Binding to Normal and Myopathic", "label": "conclusion", "metadata": "10700033"}

{"text": "Observed rate constants, k0b,, were obtained by fitting the equation I(t)=oe-kobs+C by the method of moments, where I is the observed fluorescence intensity, and I0 is the amplitude of fluorescence change. 38 ", "label": "conclusion", "metadata": "235564322"}

{"text": "The capabilities of modern angiographic platforms have recently improved substantially.", "label": "conclusion", "metadata": "2877272"}

{"text": "Few studies have concentrated specifically on the outcomes after surgery.", "label": "conclusion", "metadata": "18989842"}

The job is to fast retrieve the text for the record with metadata "18982114"

oguz ismail
  • 1
  • 16
  • 47
  • 69
Yingqiang Gao
  • 939
  • 4
  • 16
  • 29

3 Answers3

0

Use json package to convert the json object into a dictionary then use the data stored in the metadata key. here is an working example:

# importing the module 
import json 

# Opening JSON file 
with open('data.json') as json_file: 
    data = json.load(json_file) 

    # Print the type of data variable 
    print("Type:", type(data)) 

    # Print the data of dictionary 
    print("metadata: ", data['metadata']) 
Petronella
  • 2,327
  • 1
  • 15
  • 24
  • is there a way of doing this without loading and looping over the entire file? cause this file is about 10 GB large, so I was wondering if there exists ways of using shell command to do this... – Yingqiang Gao Mar 16 '21 at 09:59
  • well thenwhat you ask is not python but shell commands; you can use string shell operations, split the object on metadata key, use pattern matching, the lot – Petronella Mar 16 '21 at 10:02
0

You can try this approach:

import json


with open('data.json') as data_json: 
    data = json.load(data_json) 
MATCH_META_DATA = '18982114'
match_records = []
for part_data in data:
    if part_data.get('metadata') == MATCH_META_DATA:
        match_records.append(part_data)
Andrei Gurko
  • 178
  • 5
0

Let us imagine we have the following JSON content in example.json:

{
    "1":{"text": "Some text 1.", "label": "xxx", "metadata": "18982114"},
    "2":{"text": "Some text 2.", "label": "yyy", "metadata": "18982114"},
    "3":{"text": "Some text 3.", "label": "zzz", "metadata": "something else"}
}

You can do the following:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json

# 1. read json content from file
my_json = None
with open("example.json", "r") as file:
  my_json = json.load(file)

# 2. filter content
#    you can use a list instead of a new dictionary if you don't want to create a new json file
new_json_data = {}
for record_id in my_json:
    if my_json[record_id]["metadata"] == str(18982114):
        new_json_data[record_id] = my_json[record_id]

# 3. write a new json with filtered data
with open("result.json"), "w") as file:
    json.dump(new_json_data, file)

This will output the following result.json file:

{"1": {"text": "Some text 1.", "label": "", "metadata": "18982114"}, "2": {"text": "Some text 2.", "label": "", "metadata": "18982114"}}
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
SpaceBurger
  • 537
  • 2
  • 12
  • is it doable without loop over the entire file? cause I have to query this file many times, and loop over a 10 GB file is cumbersome... – Yingqiang Gao Mar 16 '21 at 10:06
  • you may want to look at `jq` or `jqawk` shell commands (otherwise you would have to implement json stream parsing in python, unless there is a library that can help...) ; see https://stackoverflow.com/questions/44215931/split-json-into-multiple-files and and https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions (see select operator) – SpaceBurger Mar 16 '21 at 10:19
  • You can get the indexes of the records you want with `jq --stream '. | select(length == 2) | select(.[0][1] == "metadata") | select(.[1] == "18982114") | .[0][0]' example.json` (or `jq '.[] | select(.metadata == "18982114")'` if you are not using streaming). However, I am not sure how you can get the corresponding records with it, maybe combining it with some python. – SpaceBurger Mar 16 '21 at 12:13