27

I'm reading data from a database (50k+ rows) where one column is stored as JSON. I want to extract that into a pandas dataframe. The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db. Note that not all the items have the same attributes and that the JSON have some nested attributes.

How could I make this faster?

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

df.data.apply(json.loads) \
       .apply(pd.io.json.json_normalize)\
       .pipe(lambda x: pd.concat(x.values))
###this returns a dataframe where each JSON key is a column
jodoox
  • 821
  • 2
  • 7
  • 22

3 Answers3

36

json_normalize takes an already processed json string or a pandas series of such strings.

pd.io.json.json_normalize(df.data.apply(json.loads))

setup

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])
piRSquared
  • 285,575
  • 57
  • 475
  • 624
19

I think you can first convert string column data to dict, then create list of numpy arrays by values and last DataFrame.from_records:

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

a = df.data.apply(json.loads).values.tolist() 
print (pd.DataFrame.from_records(a))

Another idea:

 df = pd.json_normalize(df['data'])
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks- That's about 100x faster than my initial approach. The only issue is that this doesn't expand the nested dicts. Would that be possible ? – jodoox Dec 18 '16 at 16:35
  • Check another answer ;) – jezrael Dec 18 '16 at 16:37
  • Quick question @jezrael the order of the csv and the df you are making from variable 'a' is the same right ? first record will be first record and second will be second and so on.. will they ever shuffle ? – skybunk May 23 '18 at 14:23
  • 1
    @skybunk - Yes, exactly. There is no reason for `shuffle` – jezrael May 23 '18 at 14:25
1

data = { "events":[
{
"timemillis":1563467463580, "date":"18.7.2019", "time":"18:31:03,580", "name":"Player is loading", "data":"" }, {
"timemillis":1563467463668, "date":"18.7.2019", "time":"18:31:03,668", "name":"Player is loaded", "data":"5" } ] }

from pandas.io.json import json_normalize
result = json_normalize(data,'events')
print(result)
Madhur Yadav
  • 635
  • 1
  • 11
  • 30