10

I have a dict of lists (which have variable lengths), I am looking forward to an efficient way of creating a Dataframe from it.

Assume I have minimum list length, so I can truncate size of bigger lists while creating Dataframe.

Here is my dummy code

data_dict = {'a': [1,2,3,4], 'b': [1,2,3], 'c': [2,45,67,93,82,92]}
min_length = 3

I can have a dictionary of 10k or 20k keys, so looking for an efficient way to create a DataFrame like bellow

>>> df
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
John
  • 1,212
  • 1
  • 16
  • 30
  • Related question.[Creating dataframe from a dictionary where entries have different lengths](https://stackoverflow.com/questions/19736080/creating-dataframe-from-a-dictionary-where-entries-have-different-lengths) – Trenton McKinney Sep 10 '20 at 00:33

2 Answers2

5

A one-liner solution:

#Construct the df horizontally and then transpose. Finally drop rows with nan.
pd.DataFrame.from_dict(data_dict,orient='index').T.dropna()
Out[326]: 
     a    b     c
0  1.0  1.0   2.0
1  2.0  2.0  45.0
2  3.0  3.0  67.0
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • Just adding for beginners, if you don't want to drop values from longer lists, avoid the `dropna()` part. Nice answer. – deadcode Aug 18 '20 at 15:14
3

You can filter values of dict in dict comprehension, then DataFrame works perfectly:

print ({k:v[:min_length] for k,v in data_dict.items()})
{'b': [1, 2, 3], 'c': [2, 45, 67], 'a': [1, 2, 3]}


df = pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()})
print (df)
   a  b   c
0  1  1   2
1  2  2  45
2  3  3  67

If is possible some length can be less as min_length add Series:

data_dict = {'a': [1,2,3,4], 'b': [1,2], 'c': [2,45,67,93,82,92]}
min_length = 3

df = pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()})
print (df)
   a    b   c
0  1  1.0   2
1  2  2.0  45
2  3  NaN  67

Timings:

In [355]: %timeit (pd.DataFrame({k:v[:min_length] for k,v in data_dict.items()}))
The slowest run took 5.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop

In [356]: %timeit (pd.DataFrame({k:pd.Series(v[:min_length]) for k,v in data_dict.items()}))
The slowest run took 4.50 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 937 µs per loop

#Allen's solution
In [357]: %timeit (pd.DataFrame.from_dict(data_dict,orient='index').T.dropna())
1 loop, best of 3: 16.7 s per loop

Code for timings:

np.random.seed(123)
L = list('ABCDEFGH')
N = 500000
min_length = 10000

data_dict = {k:np.random.randint(10, size=np.random.randint(N)) for k in L}
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252