1

An issue appeared when concatenating two pandas DataFrames and proceeding the update the index. After boiling down the issue we can ignore the concatenation. Despite creating a copy of the new DataFrame or it's index, changing the elements of the copy still changes the original DataFrame index. Below is a basic example you can run to create the issue.

A few alternates that have been tried:

  1. ind = df.copy().index.to_numpy(): changing ind alters df
  2. ind = df.index.copy().to_numpy(): changing ind alters df
  3. ind = df.copy(deep=True).index.to_numpy(): changing ind alters df
  4. ind = df.index.copy(deep=True).to_numpy(): changing ind does not alter df.

Why don't options 1-3 behave like option 4?

import pandas as pd

# Define two data frames
df = pd.DataFrame(index=[0,1.,2.], data={'y':[0,0,0]})
print('Original DataFrame')
print(df)

# Update index
ind = df.copy().index.to_numpy() # Option 1
#ind = df.index.copy().to_numpy() # Option 2
#ind = df.copy(deep=True).index.to_numpy() # Option 3
#ind = df.index.copy(deep=True).to_numpy() # Option 4
ind[:] += 3

# Why does the index of (df) get updated?
print("\n\nAfter updating copy of index:")
print(df)

Output (Pandas v1.0.1, Python v3.7.4):

Original DataFrame
     y
0.0  0
1.0  0
2.0  0


After updating copy of index:
     y
3.0  0
4.0  0
5.0  0
Hamid
  • 1,355
  • 1
  • 11
  • 21
  • I have tried to explain how they are different with a brief explanation, hope that answers the question ! – Umar Aftab May 15 '20 at 20:41
  • For reference https://stackoverflow.com/questions/35910577/why-does-python-numpys-mutate-the-original-array. Perhaps use `ind = ind + 3` if for some reason you need to ensure changes don't propagate back to the df. – ALollz May 16 '20 at 03:04

2 Answers2

2

The simple answer is, the culprit is to_numpy() (Emphasis mine):

copy: bool, default False
Whether to ensure that the returned value is a not a view on another array. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.

>>> ind = df.copy().index.to_numpy(copy=True)
>>> ind
array([0., 1., 2.])
>>> df
     y
0.0  0
1.0  0
2.0  0
>>> ind += 3
>>> df
     y
0.0  0
1.0  0
2.0  0
>>> ind
array([3., 4., 5.])

Since to_numpy uses np.asarray, it's worthwhile to make note of this bit as well (Emphasis mine):

out : ndarray
Array interpretation of a. No copy is performed if the input is already an ndarray with matching dtype and order. If a is a subclass of ndarray, a base class ndarray is returned.


The deeper answer is: the underlying object reference of the index is carried over, unless a true copy is explicitly made on the index, not the df itself. Observe this test:

tests = '''df.index
df.copy().index
df.index.copy()
df.copy(deep=True).index
df.index.copy(deep=True)'''

print('Underlying object reference test...')
for test in tests.split('\n'):

    # !!! Do as I say not as I do  !!!
    # !!! eval will ruin your life !!!

    print(f'{"{:54}".format(f"With {test} is:")}{eval(test).values.__array_interface__["data"]}')
    print(f'{"{:54}".format(f"With {test}.to_numpy() is:")}{eval(test).to_numpy().__array_interface__["data"]}')
    print(f'{"{:54}".format(f"With {test}.to_numpy(copy=True) is:")}{eval(test).to_numpy(copy=True).__array_interface__["data"]}')

Results:

Underlying object reference test...
With df.index is:                                     (61075440, False) # <-- reference to watch for
With df.index.to_numpy() is:                          (61075440, False) # same as df.index
With df.index.to_numpy(copy=True) is:                 (61075504, False) # True copy
With df.copy().index is:                              (61075440, False) # same as df.index
With df.copy().index.to_numpy() is:                   (61075440, False) # same as df.index
With df.copy().index.to_numpy(copy=True) is:          (61075504, False) # True copy
With df.index.copy() is:                              (61075440, False) # same as df.index
With df.index.copy().to_numpy() is:                   (61075440, False) # same as df.index
With df.index.copy().to_numpy(copy=True) is:          (61075504, False) # True copy
With df.copy(deep=True).index is:                     (61075440, False) # same as df.index
With df.copy(deep=True).index.to_numpy() is:          (61075440, False) # same as df.index
With df.copy(deep=True).index.to_numpy(copy=True) is: (61075504, False) # True copy
With df.index.copy(deep=True) is:                     (61075504, False) # True copy
With df.index.copy(deep=True).to_numpy() is:          (61075504, False) # True copy
With df.index.copy(deep=True).to_numpy(copy=True) is: (61075472, False) # True copy of True copy

As you can see, unless the explicit true copy is made on the index directly, or on the to_numpy method, you'll always inadvertently change your existing data.

As to why the True Copies have the same reference (except True copy of True copy), I don't have a full appreciation of what's happening under the hood. But I'm guessing it has to do with some optimization magic to save memory. That however, is probably for another question.

r.ook
  • 13,466
  • 2
  • 22
  • 39
  • OP in case you already saw this, I updated the answer with a more exhaustive test for some sanity. Hopefully it helps. – r.ook May 16 '20 at 02:52
  • I think you could argue that the issue is really with the `+=` operator. Using `ind = ind + 3` gives a changed `ind` and an unchanged `df` regardless of which of the 4 methods you choose. – ALollz May 16 '20 at 03:00
  • 2
    @ALollz Well yes, but no... because if the object reference wasn't carried over, modifying the object inplace wouldn't have mattered either way. There could have been a use case for the `+=` that we don't know of. I do see your point, but I think object reference is something that's kinda a "gotcha" for the unaware so I usually prefer to keep that mindful. – r.ook May 16 '20 at 03:03
  • @r.ook This is helpful. Would you agree that the culprit is not `to_numpy()`, however? If a proper copy (i.e of both data and index or just index) was made in the previous chain of commands then `to_numpy()` won't need any explicit keyword arguments. But what I'm reading from your answer is that adding the `copy=True` keyword argument to `to_numpy()` ensures a proper copy regardless of the earlier chain of commands. – Hamid May 17 '20 at 17:35
  • As you mentioned @r.ook, perhaps this is for another question, but the level of details required to make a full copy of a dataframe seems unfriendly and maybe even considered a bug. Maybe there is a deeper reason but from a user perspective the behavior is surprising. – Hamid May 17 '20 at 17:38
  • Yeah, the test that `df.copy(deep=True).index` is still the same is a little surprising for me. I actually dug further, and believe there's several component that leads to this. The root is that you explicitly assigned an object to the `index=...` kwarg in the df constructor. When not explicitly defined this issue is not observed. The second is that you change the returned object of `to_numpy()`. If you had done `df.copy().index +=1`, this would not be observed, at all. `pd.Index` object probably have some built in to copy these underlying references when updating in place. – r.ook May 17 '20 at 20:05
  • So I still think the main issue resides with `to_numpy()`, since it's the universal keypoint in your example that can guarantee a true copy. Whether or not the underlying reference within the `index` is the same across copies become less important, because `pd.Index` handles that updating separately. But as soon as you move it away into `numpy` it handles it differently. e.g. `a1 = np.array([1,2,3]); a2= a1.copy(); a2+=1` would not impact `a1`. – r.ook May 17 '20 at 20:19
  • Thanks @r.ook. Good to know to think twice before using `to_numpy()`. Had the example used `ind = np.array(df.copy().index)` the issue would not be observed. – Hamid May 26 '20 at 17:53
0

Difference between Deep and Shallow Copy:

In a Shallow Copy: only the reference of the object gets copied, so any change to the original object or copied object causes a change to both objects.

In a Deep Copy :the entire object is copied along with the reference, hence any changes to either of the objects does not affect the other(i.e they are independent objects)

Cases Provide :

The first one is shallow copy and in shallow copy the index gets altered, there is no guarantee of keeping the index intact :

ind = df.copy().index.to_numpy(): changing ind alters df

The second one copies the dataframe and the index but by default it copies shallow and not deep. Hence the index does not stay intact:

ind = df.index.copy().to_numpy(): changing ind alters df

The third one copies all the elements of the dataframe with a deep Copy but since it does not take into account the index, the index gets altered:

ind = df.copy(deep=True).index.to_numpy(): changing ind alters df

As for this last one, the index is part of the deep copy hence it is completely copied keeping index intact, so df has no relation with ind except that ind is a full copy of it and exists independently :

ind = df.index.copy(deep=True).to_numpy(): changing ind does not alter df.

In the above cases, when you make a shallow copy any change in either df or ind results in change of the index. But with deep copy along with the index you have 2 completely independent dataframes.

Umar Aftab
  • 527
  • 4
  • 24
  • Documentation for [pandas.DataFrame.copy](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html) shows that the default is a deep copy. The documentation makes it sound like a deep copy will create a "copy of the object's indicies and data". Does that conflict with the explanation of the first case? – Hamid May 15 '20 at 20:56
  • It does conflict, because based on the behavior you mentioned it is a shallow copy. A deep copy will take into account the indices – Umar Aftab May 15 '20 at 21:02
  • could have something to do with `to_numpy()` – Umar Aftab May 15 '20 at 21:04
  • Thanks for kicking off the deeper discussion into deep/shallow copies of pandas dataframe and index. – Hamid May 17 '20 at 17:28