21

I have an interesting problem, and I'm wondering if there's a concise, pythonic (pandastic?) way to do this, rather than iterating over rows of a data frame.

Take a DataFrame with one field that is a json encoding of information:

    Name      Data
0   Joe       '[{"label":"a","value":"1"},{"label":"b","value":"2"}]'
1   Sue       '[{"label":"a","value":"3"},{"label":"c","value":"4"}]'
2   Bob       '[{"label":"b","value":"4"},{"label":"d","value":"1"}]'

I want to expand the json field to be data fields, unioning the different column headers, to get this:

    Name      Data                 a    b    c    d
0   Joe       '[{"label":"a"...    1    2    
1   Sue       '[{"label":"a"...    3         4
2   Bob       '[{"label":"b"...         4         1

The blanks are missing values. I know I can use read_json to create data frames from the json field, but then I want to re-flatten these data frames into extra columns of the original data set.

So, is there an elegant way to do this without iterating over the various rows of the data frame? Any help would be appreciated.

David Pepper
  • 593
  • 1
  • 4
  • 14
  • 1
    The string `'{{a:1},{b:2}}'` is not valid JSON. Is that really what you have or can the actual value be decoded with `json.loads`? – unutbu Aug 26 '14 at 17:36
  • 1
    Thanks, I was being quick and sloppy. I've redone the question to have the correct json corresponding to the problem. – David Pepper Aug 26 '14 at 18:19

1 Answers1

28

Given

In [96]: df
Out[96]: 
  Name                   Data
0  Joe  [{"a":"1"},{"b":"2"}]
1  Sue  [{"a":"3"},{"c":"4"}]
2  Bob  [{"b":"4"},{"d":"1"}]

if you define

import json
def json_to_series(text):
    keys, values = zip(*[item for dct in json.loads(text) for item in dct.items()])
    return pd.Series(values, index=keys)

then

In [97]: result = pd.concat([df, df['Data'].apply(json_to_series)], axis=1)

In [98]: result
Out[98]: 
  Name                   Data    a    b    c    d
0  Joe  [{"a":"1"},{"b":"2"}]    1    2  NaN  NaN
1  Sue  [{"a":"3"},{"c":"4"}]    3  NaN    4  NaN
2  Bob  [{"b":"4"},{"d":"1"}]  NaN    4  NaN    1

Given

In [22]: df
Out[22]: 
  Name                                               Data
0  Joe  [{"label":"a","value":"1"},{"label":"b","value...
1  Sue  [{"label":"a","value":"3"},{"label":"c","value...
2  Bob  [{"label":"b","value":"4"},{"label":"d","value...

if you define

def json_to_series(text):
    keys, values = zip(*[(dct['label'], dct['value']) for dct in json.loads(text)])
    return pd.Series(values, index=keys)

then

In [20]: result = pd.concat([df, df['Data'].apply(json_to_series)], axis=1)

In [21]: result
Out[21]: 
  Name                                               Data    a    b    c    d
0  Joe  [{"label":"a","value":"1"},{"label":"b","value...    1    2  NaN  NaN
1  Sue  [{"label":"a","value":"3"},{"label":"c","value...    3  NaN    4  NaN
2  Bob  [{"label":"b","value":"4"},{"label":"d","value...  NaN    4  NaN    1

References:

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    Thanks very much unutbu -- this is great. Is it easy to tweak the code so that it works on the input I gave in the revised question, with the column headers included in the json? – David Pepper Aug 27 '14 at 15:12
  • 1
    Sure. A little change is all you need. The tools used here take some time to explain. I think the docs do a better job than I could. So I've left some links to references; if you have any questions feel free to ask. – unutbu Aug 27 '14 at 16:12