45

I have reported this as an issue on pandas issues. In the meanwhile I post this here hoping to save others time, in case they encounter similar issues.

Upon profiling a process which needed to be optimized I found that renaming columns NOT inplace improves performance (execution time) by x120. Profiling indicates this is related to garbage collection (see below).

Furthermore, the expected performance is recovered by avoiding the dropna method.

The following short example demonstrates a factor x12:

import pandas as pd
import numpy as np

inplace=True

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

100 loops, best of 3: 15.6 ms per loop

first output line of %%prun:

ncalls tottime percall cumtime percall filename:lineno(function)

1  0.018 0.018 0.018 0.018 {gc.collect}

inplace=False

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
df = (df1-df2).dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 1.24 ms per loop

avoid dropna

The expected performance is recovered by avoiding the dropna method:

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
#no dropna:
df = (df1-df2)#.dropna()
## inplace rename:
df.rename(columns={col:'d{}'.format(col) for col in df.columns}, inplace=True)

1000 loops, best of 3: 865 µs per loop

%%timeit
np.random.seed(0)
r,c = (7,3)
t = np.random.rand(r)
df1 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
indx = np.random.choice(range(r),r/3, replace=False)
t[indx] = np.random.rand(len(indx))
df2 = pd.DataFrame(np.random.rand(r,c), columns=range(c), index=t)
## no dropna
df = (df1-df2)#.dropna()
## avoid inplace:
df = df.rename(columns={col:'d{}'.format(col) for col in df.columns})

1000 loops, best of 3: 902 µs per loop

Community
  • 1
  • 1
eldad-a
  • 3,051
  • 3
  • 22
  • 25

1 Answers1

71

This is a copy of the explanation on github.

There is no guarantee that an inplace operation is actually faster. Often they are actually the same operation that works on a copy, but the top-level reference is reassigned.

The reason for the difference in performance in this case is as follows.

The (df1-df2).dropna() call creates a slice of the dataframe. When you apply a new operation, this triggers a SettingWithCopy check because it could be a copy (but often is not).

This check must perform a garbage collection to wipe out some cache references to see if it's a copy. Unfortunately python syntax makes this unavoidable.

You can not have this happen, by simply making a copy first.

df = (df1-df2).dropna().copy()

followed by an inplace operation will be as performant as before.

My personal opinion: I never use in-place operations. The syntax is harder to read and it does not offer any advantages.

iff_or
  • 880
  • 1
  • 11
  • 24
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 1
    "I never use in-place operations. The syntax is harder to read and its does not offer any advantages." Interesting point. I should consider this in the future. The `.copy()` suggestion indeed solves the issue. Thanx for your detailed and prompt reply! – eldad-a Mar 20 '14 at 13:15
  • 8
    The reason I say this is that the core of pandas operations is chaining, where each operation returns a copy, e.g ``df.dropna().rename(....).sum()`` is very intuitive / readable. When you inject an inplace operation you cannot chain. – Jeff Mar 20 '14 at 13:17
  • 13
    I wouldn't say that the syntax doesn't offer any advantages-- it allows you to avoid putting a long specification on both sides of the equal sign. It's a variant of the advantage that `some_long_complicated_expression[some:long_slice, more_information_here] += 1` has over `some_long_complicated_expression[some:long_slice, more_information_here] = some_long_complicated_expression[some:long_slice, more_information_here] + 1`. – DSM Mar 20 '14 at 14:24
  • 1
    @DSM fair point; I usually just use a temporary variable, say ``mask``, where the meaning is clear. (though in your example its actually not needed on the rhs as the frame will be aligned, e.g. you can simply use: ``some_long_complicated_expression + 1`` (though their may be a perf impact) – Jeff Mar 20 '14 at 14:26
  • 1
    Not arguing the overall point, just trying to ask a probably naive question, when you say, ["The syntax is harder to read and its does not offer any advantages,"] if it really did something in place and it was huge would the memory efficiency not be a positive? Assuming local operations? – JimLohse Nov 01 '16 at 05:28
  • copy is so cheap that any perceived gains are illusory. syntax wins above all – Jeff Nov 01 '16 at 09:40
  • 1
    usually, but copy is not cheap when a df has billions of rows.. although `.copy(deep=False)` is still cheap; just make sure it's doing what you want in context – user2561747 Sep 07 '19 at 02:02