The simple answer is, the culprit is to_numpy()
(Emphasis mine):
copy: bool, default False
Whether to ensure that the returned value is a
not a view on another array. Note that copy=False
does not ensure that
to_numpy()
is no-copy. Rather, copy=True
ensure that a copy is made,
even if not strictly necessary.
>>> ind = df.copy().index.to_numpy(copy=True)
>>> ind
array([0., 1., 2.])
>>> df
y
0.0 0
1.0 0
2.0 0
>>> ind += 3
>>> df
y
0.0 0
1.0 0
2.0 0
>>> ind
array([3., 4., 5.])
Since to_numpy
uses np.asarray
, it's worthwhile to make note of this bit as well (Emphasis mine):
out : ndarray
Array interpretation of a. No copy is performed if the
input is already an ndarray
with matching dtype and order. If a is a
subclass of ndarray
, a base class ndarray
is returned.
The deeper answer is: the underlying object reference of the index
is carried over, unless a true copy is explicitly made on the index
, not the df
itself. Observe this test:
tests = '''df.index
df.copy().index
df.index.copy()
df.copy(deep=True).index
df.index.copy(deep=True)'''
print('Underlying object reference test...')
for test in tests.split('\n'):
# !!! Do as I say not as I do !!!
# !!! eval will ruin your life !!!
print(f'{"{:54}".format(f"With {test} is:")}{eval(test).values.__array_interface__["data"]}')
print(f'{"{:54}".format(f"With {test}.to_numpy() is:")}{eval(test).to_numpy().__array_interface__["data"]}')
print(f'{"{:54}".format(f"With {test}.to_numpy(copy=True) is:")}{eval(test).to_numpy(copy=True).__array_interface__["data"]}')
Results:
Underlying object reference test...
With df.index is: (61075440, False) # <-- reference to watch for
With df.index.to_numpy() is: (61075440, False) # same as df.index
With df.index.to_numpy(copy=True) is: (61075504, False) # True copy
With df.copy().index is: (61075440, False) # same as df.index
With df.copy().index.to_numpy() is: (61075440, False) # same as df.index
With df.copy().index.to_numpy(copy=True) is: (61075504, False) # True copy
With df.index.copy() is: (61075440, False) # same as df.index
With df.index.copy().to_numpy() is: (61075440, False) # same as df.index
With df.index.copy().to_numpy(copy=True) is: (61075504, False) # True copy
With df.copy(deep=True).index is: (61075440, False) # same as df.index
With df.copy(deep=True).index.to_numpy() is: (61075440, False) # same as df.index
With df.copy(deep=True).index.to_numpy(copy=True) is: (61075504, False) # True copy
With df.index.copy(deep=True) is: (61075504, False) # True copy
With df.index.copy(deep=True).to_numpy() is: (61075504, False) # True copy
With df.index.copy(deep=True).to_numpy(copy=True) is: (61075472, False) # True copy of True copy
As you can see, unless the explicit true copy is made on the index
directly, or on the to_numpy
method, you'll always inadvertently change your existing data.
As to why the True Copies have the same reference (except True copy of True copy), I don't have a full appreciation of what's happening under the hood. But I'm guessing it has to do with some optimization magic to save memory. That however, is probably for another question.