4

I have a numpy structured array that has integers and floats, I use it to initialize a pandas DataFrame:

In [497]: x = np.ones(100000000, dtype=[('f0', '<i8'), ('f1', '<f8'),('f2','<i8'),('f3', '<f8'),('f4', '<f8'),('f5', '<f8'),('f6', '<f8'),('f7', '<f8')])

In [498]: %timeit pd.DataFrame(x)
The slowest run took 4.07 times longer than the fastest. This could mean that an intermediate result is being cached 

In [498]: 1 loops, best of 3: 2min 26s per loop


In [499]: xx=x.view(np.float64).reshape(x.shape + (-1,))

In [500]: %timeit pd.DataFrame(xx)
1 loops, best of 3: 256 ms per loop

As can be seen from the code above, initializing the DataFrame with a structured array is very slow. However, if i change the data to a continuous float numpy array, it is fast. But i still need the DataFrame to have a mixture of floats and integers.

After some more tests, i realized that the DataFrame is actually copying the whole structured array (this does not occur when using the structured array float view for initialization). I found more info here: https://github.com/pydata/pandas/issues/9216

Is there anyway to speed up the initialization and avoid copying? I am open to alternate methods, but the data is coming from a structured array.

snowleopard
  • 717
  • 8
  • 19

0 Answers0