2

In python 3, pandas. Imagine there is a dataframe df with a column x

df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)

The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?

The desired output dataframe is like

x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4

None of the solution in this post seems to work in this case

parsing a dictionary in a pandas dataframe cell into new row cells (new columns)

df1=pd.DataFrame(df.loc[:,'x'].values.tolist())

print(df1)

result the same dataframe. didn't separate the column into each key per column

Any 2 cents? Thanks!

Chubaka
  • 2,933
  • 7
  • 43
  • 58

2 Answers2

4

You can also map json.loads and convert to a dataframe like;

import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)

   a  b  c
0  1  2  3
1  2  3  4

this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:

m = pd.concat([df]*20000,ignore_index=True)

%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
anky
  • 74,114
  • 11
  • 41
  • 70
  • Can you test `pd.DataFrame([json.loads(x) for x in df['x']], index=m.index)` ? – jezrael Mar 29 '20 at 09:38
  • 1
    @jezrael [map is a tad bit faster](https://stackoverflow.com/questions/11616599/python-list-comprehensions-vs-map) , : `312 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)` – anky Mar 29 '20 at 09:41
  • hmmm, I got `%timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index) 206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index) 210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – jezrael Mar 29 '20 at 09:41
  • a bit faster `list comprehension` – jezrael Mar 29 '20 at 09:41
  • @jezrael okay my tests gives anyways the idea of the answer is use json loads for performance :) now you have added it too in your answer .. Great..!! – anky Mar 29 '20 at 09:43
  • 1
    Thank you for your detailed answer. Really appreciate it. Learning a lot! @anky_91 – Chubaka Mar 29 '20 at 19:31
1

Because string repr of dictionaries is necessary convert values to dictionaries:

import ast, json

#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)

In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#anky_91 solution    
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

print(df1)
   a  b  c
0  1  2  3
1  2  3  4

Last for append to original:

df = df.join(df1)
print(df)
                           x  a  b  c
0  {"a":"1","b":"2","c":"3"}  1  2  3
1  {"a":"2","b":"3","c":"4"}  2  3  4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252