2

I have the following data

{ "results": [
    {
        "company": "XYZ",
        "createdAt": "2014-03-27T23:21:48.758Z",
        "email": "abc@gmail.com",
        "firstName": "abc",
        "lastName": "xyz",
        "linkedinAccount": "",
        "location": "",
        "profilePicture": {
            "__type": "File",
            "name": "ab0e-profilePicture",
            "url": "url.url.com"
        },
        "registrationGate": "normal",
        "telephone": "",
        "title": "AA",
        "updatedAt": "2014-03-27T23:24:20.220Z",
        "username": "abc@gmail.com",
        "zipcode": "00000"
    } 
    ] 
    }

I import the json data using the following code

import json
import pandas as pd

from pandas import DataFrame
json_data = pd.read_json('data.json')

print json_data[:2]

This prints

results
0  {u'linkedinAccount': u'', u'username': u'abc...
1  {u'linkedinAccount': u'zxcflcnv', u'username...

[2 rows x 1 columns]

when I try to print a column using

print df['linkedinAccount']

I get the following error

KeyError: u'no item named linkedinAccount'

How do I access data in the dataframes based on column names?

CT Zhu
  • 52,648
  • 17
  • 120
  • 133
user3509031
  • 37
  • 1
  • 5
  • Possible dupe: http://stackoverflow.com/questions/21494030/create-a-pandas-dataframe-from-deeply-nested-json – U2EF1 Apr 22 '14 at 00:52
  • @U2EF1, it is similar but I think that solution would be less ideal for this situation if you might agree. – CT Zhu Apr 22 '14 at 03:50

1 Answers1

2

Not sure how your multiple observations are organized in json. But it is clear that what is causing problem is you are having a nested structure for the "profilePicture" field. Therefore each observation is expressed as a nested dictionary. You need to convert each observation to a dataframe and concat them into the final dataframe as in this solution.

In [3]:
print df
                                             results
0  {u'linkedinAccount': u'', u'username': u'abc@g...
1  {u'linkedinAccount': u'', u'username': u'abc@g...

[2 rows x 1 columns]
In [4]: 
print pd.concat([pd.DataFrame.from_dict(item, orient='index').T for item in df.results])


  linkedinAccount       username registrationGate firstName title lastName  \
0                  abc@gmail.com           normal       abc    AA      xyz   
0                  abc@gmail.com           normal       abc    AA      xyz   

  company telephone                                     profilePicture  \
0     XYZ            {u'url': u'url.url.com', u'__type': u'File', u...   
0     ABC            {u'url': u'url.url.com', u'__type': u'File', u...   

  location                 updatedAt          email                 createdAt  \
0           2014-03-27T23:24:20.220Z  abc@gmail.com  2014-03-27T23:21:48.758Z   
0           2014-03-27T23:24:20.220Z  abc@gmail.com  2014-03-27T23:21:48.758Z   

  zipcode  
0   00000  
0   00000  

[2 rows x 14 columns]

Then you may want to think about how to deal the the profilePicture column. You can do what @U2EF1 suggested in the link. But I would probably just break that column into three columns pfPIC_url, pfPIC_type, pfPIC_name

CT Zhu
  • 52,648
  • 17
  • 120
  • 133