import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 10))
dft = df[[True, False] * 5]
# df = dft
dft2 = dft.copy()
new_data = np.random.rand(5, 10)
print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100))
print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100))
On my laptop setting values in dft
(the original subset) is about 160 times slower than setting values in dft2
(a deep copy of dft
).
Why is this the case?
Edit: Removed speculation about proxy objects.
As c. leather suggests, this is likely because of a different codepath when setting values on a copy (dft
) vs an original dataframe (dft2
).
Bonus question: removing the reference to the original DataFrame df
(by uncommenting the df = dft
line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?