how to efficiently extract fields from a JSON column?

Question

Consider the following example

data1 = [{'type': 'one', 'delta': '1', 'time': '2019'}, {'type': 'two', 'delta': '1', 'time': '2018'}]
data2 = [{'type': 'one', 'delta': '1', 'time': '2013'}, {'type': 'two', 'delta': '1', 'time': '2012'}]


dftest = pd.DataFrame({'weirdjson' : [data1, data2]})
dftest['normalcol'] = 1

dftest

Out[79]: 
                                                                                        weirdjson  normalcol  time_type_one  time_type_two
0  [{'type': 'one', 'delta': '1', 'time': '2019'}, {'type': 'two', 'delta': '1', 'time': '2018'}]          1           2019           2018
1  [{'type': 'one', 'delta': '1', 'time': '2013'}, {'type': 'two', 'delta': '1', 'time': '2012'}]          1           2013           2012

Essentially, I would like to create two columns time_type_one and time_type_two that each contain their corresponding time value (for the first row: 2019 for type one and 2018 for type two).

How can I do that in Pandas? I have many rows so I am looking for something very efficient. Thanks!

I did not downvote, but I think people downvoted since there's no attempt shown. Plus if we copy this dataframe, it will be copied as string column, instead of list of dictionary's. So it would be better to include your sample dataframe as copyable code with `pd.DataFrame(..)` — Erfan, Dec 29 '19 at 20:27
Since youre at it. It's also best to include an expected output in the form of a dataframe, this way people can visually see what you try to do. — Erfan, Dec 29 '19 at 20:28
I think you can take a look at this: https://stackoverflow.com/questions/39899005/how-to-flatten-a-pandas-dataframe-with-some-columns-as-json — E. Zeytinci, Dec 29 '19 at 21:32

score 1 · Answer 1 · answered Dec 29 '19 at 21:43

1

Try this:

import json
import pandas as pd

data = [{'normalcol':1, 'weirdjsoncol':'[{"type": "one", "delta": "1", "time": "2019"}, {"type": "two", "delta": "1", "time": "2018"}]'}, {'normalcol':2, 'weirdjsoncol':'[{"type": "two", "delta": "1", "time": "2017"}, {"type": "one", "delta": "1", "time": "2013"}]'}]

df = pd.DataFrame(data)

df['time_type_one'] = df['weirdjsoncol'].apply(lambda x: next((i for i in json.loads(x) if i["type"] == "one"), None)["time"])

df['time_type_two'] = df['weirdjsoncol'].apply(lambda x: next((i for i in json.loads(x) if i["type"] == "two"), None)["time"])

answered Dec 29 '19 at 21:43

Zeeshan

1,078
9
14

nice, but what is the `next` for? – ℕʘʘḆḽḘ Dec 29 '19 at 21:53
1

Its a Python built-in function to retrieve the next item from an iterator. You can read about it here at Python documentation - https://docs.python.org/3/library/functions.html#next – Zeeshan Dec 29 '19 at 21:59
thanks. this looks good. I wonder whether iterating is fast or slow though. any thoughts? – ℕʘʘḆḽḘ Dec 29 '19 at 22:18

oppressionslayer · Answer 2 · 2019-12-29T22:02:18.060

1

You can try this:

df_new = pd.DataFrame().append([x[y] for x in dftest.weirdjson for y in range(len(dftest.weirdjson))])
df_new = df_new.pivot(columns='type', values=['delta', 'time']).apply(lambda x: pd.Series(x.dropna().values)) 
df_new.columns = ['_'.join(col) for col in df_new.columns.values] 

  delta_one delta_two time_one time_two
0         1         1     2019     2018
1         1         1     2013     2017

edited Dec 29 '19 at 22:02

answered Dec 29 '19 at 21:56

oppressionslayer

6,942
2
7
24

I updated the second line, so should work now with your dftest, i forgot the df_new = – oppressionslayer Dec 29 '19 at 22:03
I'm not sure how fast pivot is, i can do it another way without pivot, where i aggregrate the values as lists and expand them out, but it's a few more steps, but with shorter lines – oppressionslayer Dec 29 '19 at 22:05

score 1 · Accepted Answer · answered Dec 30 '19 at 01:39

You may use explode, and construct a new dataframe and unstack type to columns as follows:

s = dftest.weirdjson.explode()
df_new = (pd.DataFrame({'type': s.str['type'], 'time': s.str['time']}) 
            .set_index('type', append=True).time.unstack().add_prefix('time_type_'))

Out[461]:
type time_type_one time_type_two
0             2019          2018
1             2013          2012

how to efficiently extract fields from a JSON column?

3 Answers3