Fast convert JSON column into Pandas dataframe

Question

I'm reading data from a database (50k+ rows) where one column is stored as JSON. I want to extract that into a pandas dataframe. The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db. Note that not all the items have the same attributes and that the JSON have some nested attributes.

How could I make this faster?

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

df.data.apply(json.loads) \
       .apply(pd.io.json.json_normalize)\
       .pipe(lambda x: pd.concat(x.values))
###this returns a dataframe where each JSON key is a column

Would `df.data.apply(lambda x: pd.Series(json.loads(x)))` do? — Zero, Dec 18 '16 at 15:28
Can you store your pasted data in a different (any kind of a standard) format? — MaxU - stand with Ukraine, Dec 18 '16 at 15:38
@MaxU: if possible, I'd prefer not to change the scraping script — jodoox, Dec 18 '16 at 16:40

piRSquared · Accepted Answer · 2016-12-18T16:43:12.910

36

json_normalize takes an already processed json string or a pandas series of such strings.

pd.io.json.json_normalize(df.data.apply(json.loads))

setup

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

edited Dec 18 '16 at 16:43

answered Dec 18 '16 at 16:35

piRSquared

285,575
57
475
624

Thanks. That's faster than your first solution ;) – jodoox Dec 18 '16 at 16:47
1

I get this error : 'DataFrame' object has no attribute 'data' – Ali Mirzaei Jul 25 '17 at 12:24
1

@AliMirzaei Replace it with your own column name. – spicypumpkin Oct 02 '18 at 21:10
It would be great if your answer and Madhur Yadav's combined so that it included a example. – Jan Pisl Nov 20 '20 at 11:56

jezrael · Answer 2 · 2020-10-25T07:19:05.300

19

I think you can first convert string column data to dict, then create list of numpy arrays by values and last DataFrame.from_records:

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

a = df.data.apply(json.loads).values.tolist() 
print (pd.DataFrame.from_records(a))

Another idea:

 df = pd.json_normalize(df['data'])

edited Oct 25 '20 at 07:19

answered Dec 18 '16 at 15:44

jezrael

822,522
95
1,334
1,252

Thanks- That's about 100x faster than my initial approach. The only issue is that this doesn't expand the nested dicts. Would that be possible ? – jodoox Dec 18 '16 at 16:35
Check another answer ;) – jezrael Dec 18 '16 at 16:37
Quick question @jezrael the order of the csv and the df you are making from variable 'a' is the same right ? first record will be first record and second will be second and so on.. will they ever shuffle ? – skybunk May 23 '18 at 14:23
1

@skybunk - Yes, exactly. There is no reason for `shuffle` – jezrael May 23 '18 at 14:25

score 1 · Answer 3 · answered Jul 25 '19 at 10:32

data = { "events":[
{
"timemillis":1563467463580, "date":"18.7.2019", "time":"18:31:03,580", "name":"Player is loading", "data":"" }, {
"timemillis":1563467463668, "date":"18.7.2019", "time":"18:31:03,668", "name":"Player is loaded", "data":"5" } ] }

from pandas.io.json import json_normalize
result = json_normalize(data,'events')
print(result)

Fast convert JSON column into Pandas dataframe

3 Answers3

Linked

Related