8

I have a dict like this:

data = {'1':{'a':10, 'b':30}, '2':{'a':20, 'b':60}}

and I would like to convert in into a dataframe like this:

x   y   z
1   a   10
1   b   30
2   a   20
2   b   60

Does anybody know how?

jpp
  • 159,742
  • 34
  • 281
  • 339
amrutha
  • 193
  • 1
  • 11

3 Answers3

12

Use dictionary comprehension with concat:

df = pd.concat({k: pd.Series(v) for k, v in data.items()}).reset_index()
df.columns = list('xyz')

print (df)
   x  y   z
0  1  a  10
1  1  b  30
2  2  a  20
3  2  b  60

For better performance use list compehension with sorting:

L = sorted([(k,k1,v1) for k,v in data.items() for k1,v1 in v.items()], 
            key=lambda x: (x[0], x[1]))
print (L)
[('1', 'a', 10), ('1', 'b', 30), ('2', 'a', 20), ('2', 'b', 60)]

df = pd.DataFrame(L, columns=list('xyz'))
print (df)
   x  y   z
0  1  a  10
1  1  b  30
2  2  a  20
3  2  b  60

Timings:

In [34]: %timeit jez1(data)
16.8 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [35]: %timeit jez(data)
1.96 s ± 90.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [37]: %timeit jp(data)
43 ms ± 353 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Same code as @jp:

data = {str(k): {'a': 10, 'b': 30} for k in range(10000)}

def jp(data):
    return pd.melt(pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'x'}),
                   id_vars=['x'], value_vars=['a', 'b'], var_name='y', value_name='z')\
             .sort_values(['x', 'y']).reset_index(drop=True)

def jez(data):
    df = pd.concat({k: pd.Series(v) for k, v in data.items()}).reset_index()
    df.columns = list('xyz')
    return df

def jez1(data):
    L = sorted([(k,k1,v1) for k,v in data.items() for k1,v1 in v.items()], key=lambda x: (x[0], x[1]))
    df = pd.DataFrame(L, columns=list('xyz'))
    return df

assert (jez1(data).values == jez(data).values).all()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Just saw your edit - nice one. I'd still like to *understand* the internals of why `concat` is slow. Maybe we need to start a `pandas-internal` tag!? – jpp Apr 15 '18 at 14:34
  • 1
    @jpp in my opinion reason is to many repeatings of concat of small `Series`. – jezrael Apr 15 '18 at 15:04
  • If suppose my dic is "data = {'1':{'a':10, 'b':30}, '2':{'a':20, 'b':60, 'c':30,'d' = 10}}" will the same code work?? – amrutha Apr 16 '18 at 04:48
  • my dic length is 2376. but when i convert it to df, the len of df is 2320. but in real case,it has to be >2376.No idea why it's missing some of the rows.. – amrutha Apr 16 '18 at 04:50
  • @amrutha - I test my solution with your data and working nice. How do you get length of dicts? – jezrael Apr 16 '18 at 05:13
  • I try test it by `print (sum([len(v) for k, v in data.items()]))` – jezrael Apr 16 '18 at 05:17
  • 1
    @amrutha - You can use `from collections import Counter` first, and then `data = {k:Counter(v) for k, v in data.items()}`, last apply first or second my solution. – jezrael Apr 16 '18 at 06:07
5

Here is one way using pandas.melt.

d = {'1':{'a':10, 'b':30}, '2':{'a':20, 'b':60}}

res = pd.melt(pd.DataFrame.from_dict(d, orient='index'),
              value_vars=['a', 'b'], var_name='y', value_name='z')

print(res)

#    y   z
# 0  a  10
# 1  a  20
# 2  b  30
# 3  b  60

Performance benchmarking

I expected pandas.melt to be inefficient, but applying pandas.concat on a large number of dictionaries can be even more expensive.

data = {str(k): {'a': 10, 'b': 30} for k in range(10000)}

def jp(data):
    return pd.melt(pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'x'}),
                   id_vars=['x'], value_vars=['a', 'b'], var_name='y', value_name='z')\
             .sort_values(['x', 'y']).reset_index(drop=True)

def jez(data):
    df = pd.concat({k: pd.Series(v) for k, v in data.items()}).reset_index()
    df.columns = list('xyz')
    return df

assert (jp(data).values == jez(data).values).all()

%timeit jp(data)   # 51.8 ms per loop
%timeit jez(data)  # 2.62 s per loop
jpp
  • 159,742
  • 34
  • 281
  • 339
2

Using Series

pd.Series(d).apply(pd.Series).stack().reset_index()
Out[359]: 
  level_0 level_1   0
0       1       a  10
1       1       b  30
2       2       a  20
3       2       b  60
BENY
  • 317,841
  • 20
  • 164
  • 234