How to join json with Pandas

Question

The goal here is to count the number of patients for each type of diagnosis. In patient record, visit id is unique, yet in diagnosis record, since one visit could have multiple diagnosis, the same visit id could have multiple diagnosis id.

To do so, i think the 2 data frame need to be linked with the field visit id. Could anyone kindly shed some light on how to link the 2 json via Pandas and count the number of patients for each diagnosis. Many thanks

Patient record

JSON [Patient record]

[
 {
   "Doctor id":"AU1254",
   "Patient":[
      {
         "Patient id":"BK1221",
         "Patient name":"Tim"
      }
   ],  
   "Visit id":"B0001"       
},
 {
   "Doctor id":"AU8766",
   "Patient":[
      {
         "Patient id":"BK1209",
         "Patient name":"Sue"
      }
   ],  
   "Visit id":"B0002"  
},
 {
   "Doctor id":"AU1254",
   "Patient":[
      {
         "Patient id":"BK1323",
         "Patient name":"Sary"
      }
   ],  
   "Visit id":"B0003"  
  }
]

Diagnosis record

JSON [Diagnosis record]

[
   {
      "Visit id":"B0001",
      "Diagnosis":[
         {
            "diagnosis id":"D1001",
            "diagnosis name":"fever"           
         },
         {
            "diagnosis id":"D1987",
            "diagnosis name":"cough"
         },
         {
             "diagnosis id":"D1265",
            "diagnosis name":"running nose"
         }
      ]
   }, 
      {
      "Visit id":"B0002",
      "Diagnosis":[
         {
            "diagnosis id":"D1987",
            "diagnosis name":"cough"           
         },
         {
            "diagnosis id":"D1453",
            "diagnosis name":"stomach ache"
         }
      ]
   } 
]

Danail Petrov · Answer 1 · 2020-12-28T16:17:39.533

You can use left merge() on visit id. merge

>  from pandas.io.json import json_normalize
>  import json
>  json1 = <your first json here>
>  json2 = <your second json here>
>  df1=pd.json_normalize(data=json.loads(json1), record_path='Patient', meta=['Doctor id','Visit id'])
>  df2=pd.json_normalize(data=json.loads(json2), record_path='Diagnosis', meta=['Visit id'])


>  print(df1.merge(df2, on='Visit id', how='left').dropna())
  Patient id Patient name Doctor id Visit id diagnosis id diagnosis name
0     BK1221          Tim    AU1254    B0001        D1001          fever
1     BK1221          Tim    AU1254    B0001        D1987          cough
2     BK1221          Tim    AU1254    B0001        D1265   running nose
3     BK1209          Sue    AU8766    B0002        D1987          cough
4     BK1209          Sue    AU8766    B0002        D1453   stomach ache

You can also do some fancy grouping / printing

pd.pivot_table(df3, index=['Patient id','Patient name','Doctor id','Visit id'], values=['diagnosis id','diagnosis name'], aggfunc=list)
                                                     diagnosis id                diagnosis name
Patient id Patient name Doctor id Visit id
BK1209     Sue          AU8766    B0002            [D1987, D1453]         [cough, stomach ache]
BK1221     Tim          AU1254    B0001     [D1001, D1987, D1265]  [fever, cough, running nose]

And count per diagnosis / per patient

df3.groupby(['diagnosis id', 'diagnosis name']).agg({'Patient name': [list, 'count']})
                            Patient name
                                    list count
diagnosis id diagnosis name
D1001        fever                 [Tim]     1
D1265        running nose          [Tim]     1
D1453        stomach ache          [Sue]     1
D1987        cough            [Tim, Sue]     2

Hi @Danail, im gettting SyntaxError: EOL while scanning string literal. may i know if json1 is a string enclosed with ' ' — epiphany, Dec 28 '20 at 16:08
Just paste the json content as string and you should be all good. Like this: json1=''' copy->paste ''' — Danail Petrov, Dec 28 '20 at 16:09

Pygirl · Accepted Answer · 2020-12-28T16:03:57.793

1

Try: (x--> JSON [Patient record], y-->JSON [Diagnosis record]

df = pd.DataFrame(x)
df = pd.concat([df.pop('Patient').apply(lambda x: pd.Series(x[0])), df], axis=1)

df1 = pd.DataFrame(y)

df1 = pd.concat([df1.explode('Diagnosis')['Diagnosis'].apply(pd.Series), df1], axis=1)
df1.pop('Diagnosis')

df_merge = pd.merge(df,df1, on='Visit id', how='right')

df_merge:

    Patient id  Patient name  Doctor id Visit id  diagnosis id diagnosis name
0   BK1221      Tim           AU1254    B0001     D1001        fever
1   BK1221      Tim           AU1254    B0001     D1987        cough
2   BK1221      Tim           AU1254    B0001     D1265        running nose
3   BK1209      Sue           AU8766    B0002     D1987        cough
4   BK1209      Sue           AU8766    B0002     D1453        stomach ache

To count:

df_merge.groupby('diagnosis name')['Patient id'].count()

Edit:

Try:

df_merge.groupby('diagnosis name').agg({'Patient name': [list, 'count']}).reset_index()

diagnosis name  Patient name
                list        count
        cough   [Tim, Sue]  2
        fever   [Tim]       1
running nose    [Tim]       1
stomach ache    [Sue]       1

edited Dec 28 '20 at 16:03

answered Dec 28 '20 at 15:40

Pygirl

12,969
5
30
43

this may help: https://stackoverflow.com/questions/50839737/pandas-column-expansion-of-list-of-dictionary-how-to-optimise – Pygirl Dec 28 '20 at 15:50
Hi @Pygirl, for the diagnosis 'cough', isnt the count is 2 – epiphany Dec 28 '20 at 15:51
I have group by id then do it by name. I have updated my answer :) – Pygirl Dec 28 '20 at 15:52
hi, as i would like to count the no of patient for each diagnosis, and for diagnosis cough, the no of patient is 2 which is Tim and Sue,could you kindly advise – epiphany Dec 28 '20 at 15:56
@epiphany: Now check – Pygirl Dec 28 '20 at 16:04
Hi @Pygirl, for your solution, when in the list there's 2 patient with the same name ie [Tim, Tim], it will count 2. How can i change the count to 1, i've tried to use replace count with nunique but still the number of count will include patient with same name. Any idea? thank you – epiphany Jan 06 '21 at 15:32
Actually they are grouped by diagnosis that's why they will have count as you as there will be 2 TIM for 2 different diagnosis – Pygirl Jan 06 '21 at 15:45
okay, got it .if the patient record contains diagnosis of different years, eg 2020 cough [Tim], 2021 cough [Tim],so now in the count column for the cough diagnosis, there'll be 2 Tim. As the goal is to count the number of patient for each diagnosis, the year can be ignore. Is there a way to filter out the patient with the same name in the patient name column, many thanks – epiphany Jan 06 '21 at 15:54
Try: `df_merge.groupby('diagnosis name').agg({'Patient name': [np.unique, 'nunique']}).reset_index()` I replaced `count` with `nunique` and getting the `1` only. – Pygirl Jan 06 '21 at 16:09

score 0 · Answer 3 · answered Dec 28 '20 at 15:50

Try the following for the patient records.

patients_df = pd.read_json(patients.json)

patient_id = []
patient_name =[]

# Get attributes from nested nested datatypes in Patient column
for patient in patients_df['Patients']:
    patient_id = patient[0]['Patient id']
    patient_name = patient[0]['Patient name']

# Add to the pandas dataframe
patients_df['Patient name'] = patient_name
patient_df['Patient id'] = patient_id

# Drop the 'Patient' column
patients_df = patients_df.drop(column='Patient')

How to join json with Pandas

3 Answers3