Python/pandas: data frame from series of dict: optimization

Question

I have a pandas Series of dictionnaries, and I want to convert it to a data frame with the same index.

The only way I found is to pass through the to_dict method of the series, which is not very efficient because it goes back to pure python mode instead of numpy/pandas/cython.

Do you have suggestions for a better approach?

Thanks a lot.

>>> import pandas as pd
>>> flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
>>> flagInfoSeries
0      {'a': 1, 'b': 2}
1    {'a': 10, 'b': 20}
dtype: object
>>> pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20

jezrael · Accepted Answer · 2016-02-24T13:56:46.813

4

I think you can use comprehension:

import pandas as pd

flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
print flagInfoSeries
0      {u'a': 1, u'b': 2}
1    {u'a': 10, u'b': 20}
dtype: object

print pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
0   1   2
1  10  20

print pd.DataFrame([x for x in flagInfoSeries])
    a   b
0   1   2
1  10  20

Timing:

In [203]: %timeit pd.DataFrame(flagInfoSeries.to_dict()).T
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 554 µs per loop

In [204]: %timeit pd.DataFrame([x for x in flagInfoSeries])
The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 361 µs per loop

In [209]: %timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 751 µs per loop

EDIT:

If you need keep index, try add index=flagInfoSeries.index to DataFrame constructor:

print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)

Timings:

In [257]: %timeit pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
1000 loops, best of 3: 350 µs per loop

Sample:

import pandas as pd

flagInfoSeries = pd.Series(({'a': 1, 'b': 2}, {'a': 10, 'b': 20}))
flagInfoSeries.index = [2,8]
print flagInfoSeries
2      {u'a': 1, u'b': 2}
8    {u'a': 10, u'b': 20}

print pd.DataFrame(flagInfoSeries.to_dict()).T
    a   b
2   1   2
8  10  20

print pd.DataFrame([x for x in flagInfoSeries], index=flagInfoSeries.index)
    a   b
2   1   2
8  10  20

edited Feb 24 '16 at 13:56

answered Feb 24 '16 at 11:07

jezrael

822,522
95
1,334
1,252

Yep, so your computer is faster, but your code still wins :) – IanS Feb 24 '16 at 11:16
Yes, you are right. I want add comparing in my PC. :) – jezrael Feb 24 '16 at 11:17
Thanks for those suggestions. Indeed, there are improvements in terms of performances ... but the indexes are not kept: the list comprehension gives the a list `[{mydict}, ...]`, without the index, while the `to_dict` gives a dictionnary of `{index: {mydict}, ...}`. I think I'll keep it like this for now. – Michael Hooreman Feb 24 '16 at 13:02
Solution was modified, please check it. – jezrael Feb 24 '16 at 13:12
It's even faster with the index! – IanS Feb 24 '16 at 13:19
Yes, first I was surprised too, but I think `DataFrame` constructor take index, not count, so it is faster. – jezrael Feb 24 '16 at 13:24
Great! It works perfectly and it is fast. Thanks a lot! – Michael Hooreman Feb 24 '16 at 14:31
Great comparison! Anyway, I think using `pd.DataFrame(flagInfoSeries.to_numpy().tolist())` is even faster than your proposal (which would be even more visible when running for longer series, not just two items). – Nerxis Jan 28 '22 at 13:39

score 1 · Answer 2 · answered Feb 22 '22 at 16:39

1

You can use pd.json_normalize(flagInfoSeries).

answered Feb 22 '22 at 16:39

Carlos Horn

1,115
4
17

score 0 · Answer 3 · edited May 23 '17 at 12:23

0

This avoids to_dict, but apply could be slow too:

flagInfoSeries.apply(lambda dict: pd.Series(dict))

Edit: I see that jezrael has added timing comparisons. Here is mine:

%timeit flagInfoSeries.apply(lambda dict: pd.Series(dict))
1000 loops, best of 3: 935 µs per loop

edited May 23 '17 at 12:23

Community

1
1

answered Feb 24 '16 at 11:11

IanS

15,771
9
60
84

Thanks. I've tried this, but indeed, apply is slow. – Michael Hooreman Feb 24 '16 at 13:00

Python/pandas: data frame from series of dict: optimization

3 Answers3