3

Reposted from https://groups.google.com/forum/#!topic/pydata/5mhuatNAl5g

It seems when creating a DataFrame from a structured array that the data is copied? I get similar results if the data is instead a dictionary of numpy arrays.

Is there anyway to create a DataFrame from a structured array or similar without any copying or checking?

In [44]: sarray = randn(1e7,10).view([(name, float) for name in 'abcdefghij']).squeeze()

In [45]: for N in [10,100,1000,10000,100000,1000000,10000000]:
    ...:     s = sarray[:N]
    ...:     %timeit z = pd.DataFrame(s)
    ...: 
1000 loops, best of 3: 830 µs per loop
1000 loops, best of 3: 834 µs per loop
1000 loops, best of 3: 872 µs per loop
1000 loops, best of 3: 1.33 ms per loop
100 loops, best of 3: 15.4 ms per loop
10 loops, best of 3: 161 ms per loop
1 loops, best of 3: 1.45 s per loop 

Thanks, Dave

Dave Hirschfeld
  • 768
  • 2
  • 6
  • 15

2 Answers2

4

Panda's DataFrame uses BlockManager to consolidate similarly typed data into a single memory chunk. Its this consolidation into a single chunk that causes the copy. If you initialize as follows:

pd.DataFrame(npmatrix, copy=False)

then the DataFrame will not copy the data, but reference it instead.

HOWEVER, sometimes you may come with multiple arrays and BlockManager will try to consolidate the data into a single chunk. In that situation, I think your only option is to monkey patch the BlockManager to not consolidate the data.

I agree with @DaveHirschfeld, this could be provided as a consolidate=False parameter to BlockManager. Pandas would be better for it.

user48956
  • 14,850
  • 19
  • 93
  • 154
3

This will by definition coerce dtypes to a single dtype (e.g. float64 in this case). No way around that. This is a view on the original array. Note that this only helps with construction. Most operations will tend to make and return copies.

In [44]: s = sarray[:1000000]

Original Method

In [45]: %timeit DataFrame(s)
10 loops, best of 3: 107 ms per loop

Coerce to an ndarray. Pass in copy=False (this doesn't affect a structured array, ONLY a plain single dtyped ndarray).

In [47]: %timeit DataFrame(s.view(np.float64).reshape(-1,len(s.dtype.names)),columns=s.dtype.names,copy=False)
100 loops, best of 3: 3.3 ms per loop

In [48]: result = DataFrame(s.view(np.float64).reshape(-1,len(s.dtype.names)),columns=s.dtype.names,copy=False)

In [49]: result2 = DataFrame(s)

In [50]: result.equals(result2)
Out[50]: True

Note that both DataFrame.from_dict and DataFrame.from_records will copy this. Pandas keeps like-dtyped ndarrays as a single ndarray. And its expensive to do a np.concatenate to aggregate, which is what is done under the hood. Using a view avoids this issue.

I suppose this could be the default for a structrured array if the passed dtypes are all the same. But then you have to ask why you are using a structured array in the first place. (obviously to get name-access..but is their another reason?)

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • I'm Using IOPro from continuum to do large database queries. It can dramatically reduce the memory used and speed up the query by directly constructing a structured array without going through python types. It's unfortunate then that creating a DataFrame from the array will copy it, losing some of the performance benefits. – Dave Hirschfeld Jun 19 '14 at 14:29
  • unfortunately the returned data from the query isn't a single homogeneous type so the view trick won't work. Is there any reason why, if you pass a list or dictionary of heterogeneous numpy arrays it has to copy them? Could it not just take a reference to each? – Dave Hirschfeld Jun 19 '14 at 14:40
  • well as I explained above, dtypes are consolidated into single ndarrays, so they must be concatenated (which is where all of the time is spent). You can simply keep them as a dict of Series to not copy if you want. But then you must manually align them and operations become tricky. Better to do this once then write them as HDF5 files; they come already blocked by dtypes (on reading). – Jeff Jun 19 '14 at 14:45
  • 1
    Would it be feasible to have an `consolidate=False` so that it just took references to underlying arrays instead of allocating memory and copying the data? The cost of the copying could then be deferred until such time as you need to perform an operation requiring a copy. – Dave Hirschfeld Jun 19 '14 at 15:33
  • try blaze. That would be a major change that is non-trivial. It could be done (anything can be done). it certainly IS possible as for example, ``sparse`` types ARE non-consolidatable, as are ``Categoricals``. But would require a major effort by someone. – Jeff Jun 19 '14 at 15:36
  • Fair enough. Thanks for your help! – Dave Hirschfeld Jun 19 '14 at 15:51