Extract JSON arrays into dataframe columns

Question

I have project.json file, which contains data like this :

{"student_id": "ST0001", "project": [{"subject_id": "S003", "date_of_submission": "2021-05-23 20:03:05"}, {"subject_id": "S004", "date_of_submission": "2021-05-24 21:03:05"}, {"subject_id": "S005", "date_of_submission": "2021-05-30 05:09:30"}], "project_year": "Second"}
{"student_id": "ST0002", "project": [{"subject_id": "S003", "date_of_submission": "2021-06-02 15:05:05"}, {"subject_id": "S007", "date_of_submission": "2021-04-28 21:03:01"}], "project_year": "Second"}
{"student_id": "ST0002", "project": [{"subject_id": "S0018", "date_of_submission": "2020-06-03 08:15:21"}], "project_year": "First"}

I need to extract the nested subject_id and date_of_submission into a separate column like :

student_id	subject_id	date_of_submission	project_year
ST0001	S003	23/05/2021 20:03	Second
ST0001	S004	24/05/2021 21:03	Second
ST0001	S005	30/05/2021 05:09	Second
ST0002	S003	02/06/2021 15:05	Second
ST0002	S007	28/04/2021 21:03	Second
ST0002	S0018	03/06/2020 08:15	First

I think we can use json_normalize to extract one level up, can someone help me complete this;

import pandas as pd

df=pd.read_json('project.json', lines=True)

df = pd.DataFrame(df).explode('project')

did u check https://stackoverflow.com/questions/39899005/how-to-flatten-a-pandas-dataframe-with-some-columns-as-json — Epsi95, Jul 31 '21 at 15:16

Anurag Dabas · Answer 1 · 2021-08-01T12:19:35.307

0

you can try with record_path and meta parameters in json_normailze() method:

s=pd.read_json('project.json',lines=True).melt()['value'].tolist()
df=pd.json_normalize(s,record_path=['project'],meta=['student_id','project_year'])
#here data is your json data

output of df:

subject_id  date_of_submission  student_id  project_year
0   S003    2021-05-23 20:03:05     ST0001  Second
1   S004    2021-05-24 21:03:05     ST0001  Second
2   S005    2021-05-30 05:09:30     ST0001  Second
3   S003    2021-06-02 15:05:05     ST0002  Second
4   S007    2021-04-28 21:03:01     ST0002  Second
5   S0018   2020-06-03 08:15:21     ST0002  First

edited Aug 01 '21 at 12:19

answered Jul 31 '21 at 15:21

Anurag Dabas

23,866
9
21
41

I ran your answer changing `data` with `df`, it throws an error, `TypeError: 'int' object is not subscriptable`. Can you please let me know where its incorrect? – MN Af Jul 31 '21 at 17:12
@MNAf for the given sample json it is working...so sorry I can't tell without the data – Anurag Dabas Aug 01 '21 at 04:54
@MNAf we read json file like this `pd.read_json()`......updated answer hope it gets clear **:)** – Anurag Dabas Aug 01 '21 at 08:08
user:14289892 I have appended multiple json files, they don't have a square bracket as in your `data` variable. Is there a way to do using `df=pd.read_json('project.json', lines=True)` – MN Af Aug 01 '21 at 11:11
I am still getting an error, I slightly tweaked your answer and it worked.```read_json_to_df = pd.read_json('project.json', lines=True) json_struct = json.loads(read_json_to_df.to_json(orient="records")) df=pd.json_normalize(json_struct,record_path=['project'],meta=['student_id','project_year']) df``` – MN Af Aug 01 '21 at 12:11
@MNAf if you read correctly I am reading it like `pd.read_json('project.json',typ = 'series').tolist()` – Anurag Dabas Aug 01 '21 at 12:13
I get a `ValueError: Trailing data`, I believe the reason is there is more than one row. – MN Af Aug 01 '21 at 12:16
@MNAf updated answer...kindly have a look and see if it works or not **:)** – Anurag Dabas Aug 01 '21 at 12:20

Extract JSON arrays into dataframe columns

1 Answers1