Surprisingly slow access to elements of DataFrame

Question

Consider following example

x = {f"b{k}":10000*[None] for k in range(10)}
for column in x.keys():
    for i in range(10000):
        x[column][i] = i % 4 # some minor calculation here
x = pd.DataFrame(x)

It finishes in about 60ms on my computer. Similar thing can be done on Pandas's DataFrame directly:

x = pd.DataFrame(index=range(10000), columns=[f"b{i}" for i in range(10)], dtype=np.int32)
for column in x.columns:
    for i in range(10000):
        x.loc[i, column] = i % 4 # some minor calculation here

This snippet finishes in about 11.8s. Can this kind of operation be done directly on the DataFrame without a significant performance hit?

Update:

Using .at instead of .loc as suggested in the Pandas docs improves things considerably but still significantly slower compared to Python's dict.

x = pd.DataFrame(index=range(10000), columns=[f"b{i}" for i in range(10)], dtype=np.int32)
for column in x.columns:
    for i in range(10000):
        x.at[i, column] = i % 4

This finishes in about 1.2s.

Are you trying to do `x%4` ? You can read coldspeed's answer here, its pretty good https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care/54028200#54028200 — Bharath M Shetty, Jan 25 '21 at 16:12
Rule of thumb: looping on DataFrame is generally slow, use vectorized function when possible. — Quang Hoang, Jan 25 '21 at 16:23
@Bharath `i % 4` is just a dummy operation to drive the point that the calculation is not an issue. The link seems to be answering the issue of calculating the value itself. Problem in this question seems to be single-item-access related. @QuangHoang Isn't this just too slow? — tomas789, Jan 25 '21 at 16:25

juanpa.arrivillaga · Answer 1 · 2021-01-25T17:36:28.410

This is to be expected with pandas. Doing single operations in pandas like this is bound to be slow, there are several Python layers you have to get to to even begin to do the actual "work", e.g. several Python-level functions. Here's the source code

return _AtIndexer("at", self)

Note, it's part of a mixin, which will slow things down already. But more importantly, it returns an instance of _AtIndexer, which is some other object

That is when you finally get to a __getitem__ method, but even then, look what it needs to do:

def __getitem__(self, key):

    if self.ndim == 2 and not self._axes_are_unique:
        # GH#33041 fall back to .loc
        if not isinstance(key, tuple) or not all(is_scalar(x) for x in key):
            raise ValueError("Invalid call for scalar access (getting)!")
        return self.obj.loc[key]

    return super().__getitem__(key)

So it does a Python-level conditional, but the standard case is actually just

return super().__getitem__(key)

So, another Python-level function call. What does it do? Here's the source code

def __getitem__(self, key):
    if not isinstance(key, tuple):

        # we could have a convertible item here (e.g. Timestamp)
        if not is_list_like_indexer(key):
            key = (key,)
        else:
            raise ValueError("Invalid call for scalar access (getting)!")

    key = self._convert_key(key)
    return self.obj._get_value(*key, takeable=self._takeable)

Soooo even more Python-level conditionals, some other python level method calls...

We can keep digging, to see what self.obj._get_value does, which is implemented in yet some other base class, but I think you understand the point by now.

In a list/dict, once you get passed the intial method resolution, you are in the C-layer, and it does all the work there. In pandas, you do a ton of overhead in the Python layer, before it get's pushed, eventually, hopefully, into numpy, where the speed of the bulk operations occurs. It has no hope of beating the operations done on a built-in python data structure though, which are much thinner wrappers around compiled code. The performance of pandas dies a death by a thousand cuts of various method calls, internal bookeeping logic done in Python,

EDIT: I noticed, I actually went through the __getitem__ logic, but the point still stands for __setitem__, indeed, the intermediate steps seem to involve even more work.

Surprisingly slow access to elements of DataFrame

1 Answers1