Setting values on Pandas DataFrame subset (copy) is slow

Question

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(10, 10))

dft = df[[True, False] * 5]
# df = dft
dft2 = dft.copy()

new_data = np.random.rand(5, 10)

print(timeit.timeit('dft.loc[:, :] = new_data', setup='from __main__ import dft, new_data', number=100))
print(timeit.timeit('dft2.loc[:, :] = new_data', setup='from __main__ import dft2, new_data', number=100))

On my laptop setting values in dft (the original subset) is about 160 times slower than setting values in dft2 (a deep copy of dft).

Why is this the case?

Edit: Removed speculation about proxy objects.

As c. leather suggests, this is likely because of a different codepath when setting values on a copy (dft) vs an original dataframe (dft2).

Bonus question: removing the reference to the original DataFrame df (by uncommenting the df = dft line), cuts the speed factor to roughly 2 on my laptop. Any idea why this is the case?

Under the hood, `df[[True, False] * 5]` calls `Dataframe.__getitem__()` which calls `Dataframe._getitem_array()` when the indexer is a list. This in turn calls `Dataframe.take()`, which has a property is_copy. I've found that if I run `df.take([0,2,4,6,8], is_copy=True)`, I get speeds slower than `df.take([0,2,4,6,8], is_copy=False)`, with is_copy=True producing equal runtime to dft in your example, and is_copy=False producing equal runtime to dft2. So, the slowdown arises somewhere down the line because of this is_copy property, perhaps during `Dataframe.__setitem__`. — c. leather, Jul 08 '16 at 00:50
What the is_copy property is actually used for, however, is pretty murky, and it will probably take some digging in `__setitem__`. I think your feeling about the returned array being a view/proxy is a good one, and I think it has to do with this property. — c. leather, Jul 08 '16 at 00:52

score 5 · Answer 1 · edited May 23 '17 at 11:58

This is not exactly a new question on SO. This, and this are related posts. This is the link to the current docs that explains it.

The comments from @c.leather are on the right track. The problem is that dft is a view, not a copy of the dataframe df, as explained in the linked articles. But pandas cannot know whether it really is or not a copy and if the operation is safe or not, and as such there are a lot of checks going on to ensure that it is safe to perform the assignment, and that could be avoided by simply making a copy.

This is a pertinent issue and there is a whole discussion at Github. I've seen a lot of suggestions, the one I like the most is that the docs should encourage the df[[True,False] * 5].copy() idiom, one may call it the slice & copy idiom.

I could not find the exact checks, and on the github issue this performance nuance is only mentioned through some tweets a few developers posted noting the behavior. Maybe someone more involved in the pandas development can add some more input.

The question isn't about view vs. copy, it's about the reason for the speed difference. I think my speculation about proxy objects is misleading (and am striking it out). Thanks for the links to the github page! — Alex, Jul 13 '16 at 19:24

Setting values on Pandas DataFrame subset (copy) is slow

1 Answers1

Linked