8

I want to concatenate two pandas DataFrames without copying the data. That is, I want the concatenated DataFrame to be a view on the data in the two original DataFrames. I tried using concat() and that did not work. This block of code shows that changing the underlying data affects the two DataFrames that are concatenated but not the concatenated DataFrame:

arr = np.random.randn(12).reshape(6, 2)
df = pd.DataFrame(arr, columns = ('VALE5', 'PETR4'), index = dates)
arr2 = np.random.randn(12).reshape(6, 2)
df2 = pd.DataFrame(arr, columns = ('AMBV3', 'BBDC4'), index = dates)
df_concat = pd.concat(dict(A = df, B = df2),axis=1)
pp(df)
pp(df_concat)
arr[0, 0] = 9999999.99
pp(df)
pp(df_concat)

This is the output of the last five lines. df changed after a new value was assigned to arr[0, 0]; df_concat was not affected.

In [56]: pp(df)
           VALE5     PETR4
2013-01-01 -0.557180  0.170073
2013-01-02 -0.975797  0.763136
2013-01-03 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442
2013-01-06 -0.323640  0.024857

In [57]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

In [58]: arr[0, 0] = 9999999.99

In [59]: pp(df)
                 VALE5     PETR4
2013-01-01  9999999.990000  0.170073
2013-01-02       -0.975797  0.763136
2013-01-03       -0.913254  1.042521
2013-01-04       -1.973013 -2.069460
2013-01-05       -1.259005  1.448442
2013-01-06       -0.323640  0.024857

In [60]: pp(df_concat)
               A                   B          
           VALE5     PETR4     AMBV3     BBDC4
2013-01-01 -0.557180  0.170073 -0.557180  0.170073
2013-01-02 -0.975797  0.763136 -0.975797  0.763136
2013-01-03 -0.913254  1.042521 -0.913254  1.042521
2013-01-04 -1.973013 -2.069460 -1.973013 -2.069460
2013-01-05 -1.259005  1.448442 -1.259005  1.448442
2013-01-06 -0.323640  0.024857 -0.323640  0.024857

I guess this means concat() created a copy of the data. Is there a way to avoid a copy being made? (I want to minimize memory usage).

Also, is there a fast way to check if two DataFrames are linked to the same underlying data? (short of going through the trouble of changing the data and checking if each DataFrame has changed)

Thanks for the help.

FS

Soldalma
  • 4,636
  • 3
  • 25
  • 38
  • Currently the best way to check if two numpy arrays are sharing memory is shown here: http://stackoverflow.com/questions/10747748/how-do-i-check-that-two-slices-of-numpy-arrays-are-the-same-or-overlapping – Phillip Cloud Aug 18 '13 at 04:21

1 Answers1

4

You can't (at least easily). When you call concat, ultimately np.concatenate gets called.

See this answer explaining why you can't concatenate arrays without copying. The short of it is that the arrays are not guaranteed to be contiguous in memory.

Here's a simple example

a = rand(2, 10)
x, y = a
z = vstack((x, y))
print 'x.base is a and y.base is a ==', x.base is a and y.base is a
print 'x.base is z or y.base is z ==', x.base is z or y.base is z

Output:

x.base is a and y.base is a == True
x.base is z or y.base is z == False

Even though x and y share the same base, namely a, concatenate (and thus vstack) cannot assume that they do since one often wants to concatenate arbitrarily strided arrays.

You easily generate two arrays with different strides sharing the same memory like so:

a = arange(10)
b = a[::2]
print a.strides
print b.strides

Output:

(8,)
(16,)

This is why the following happens:

In [214]: a = arange(10)

In [215]: a[::2].view(int16)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-215-0366fadb1128> in <module>()
----> 1 a[::2].view(int16)

ValueError: new type not compatible with array.

In [216]: a[::2].copy().view(int16)
Out[216]: array([0, 0, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 6, 0, 0, 0, 8, 0, 0, 0], dtype=int16)

EDIT: Using pd.merge(df1, df2, copy=False) (or df1.merge(df2, copy=False)) when df1.dtype != df2.dtype will not make a copy. Otherwise, a copy is made.

Community
  • 1
  • 1
Phillip Cloud
  • 24,919
  • 11
  • 68
  • 88
  • 1
    Is this behaviour a bug? `merge` has a `copy` parameter, so when I do `df_concat = df.merge(df1, left_index=True, right_index=True, copy=False)` If the indexes of both DataFrames are the same (which in this case is true, the Series objects don't have to be changed, why is pandas copying the data when I explicitly instructed it not to with `copy=False`? Even if I first create an index and then create the `df`, `df` and `df_concat` and `df.index is `df_concat.index` returns `True`, still the data is copied. – Viktor Kerkez Aug 18 '13 at 13:18
  • I don't necessarily think it's a bug. I'm looking into it now. It should be better documented though. – Phillip Cloud Aug 18 '13 at 15:53
  • this is not a bug; concatenate does always copy. there is no way to maintain a view (of course u can simply setup subviews if you want that correlate to the orginal frames). what exactly are you trying to accomplish? – Jeff Aug 18 '13 at 16:04
  • If that's the case then I'm just asking what does the `copy` parameter of the `merge` function does? :) – Viktor Kerkez Aug 18 '13 at 16:07
  • it tries to avoid a copy of the underlying block if the indexes are the same on the given axis – Phillip Cloud Aug 18 '13 at 16:12
  • But it doesn't do that. Even if the index is the same object. – Viktor Kerkez Aug 18 '13 at 16:24
  • Hm not sure then. Going to require some more delving. – Phillip Cloud Aug 18 '13 at 16:30
  • I'll edit my answer to reflect the stuff we just talked above over at pandas. – Phillip Cloud Aug 18 '13 at 16:55