122

This has been discussed before, but with conflicting answers:

What I'm wondering is:

  • Why is inplace = False the default behavior?
  • When is it good to change it? (well, I'm allowed to change it, so I guess there's a reason).
  • Is this a safety issue? that is, can an operation fail/misbehave due to inplace = True?
  • Can I know in advance if a certain inplace = True operation will "really" be carried out in-place?

My take so far:

  • Many Pandas operations have an inplace parameter, always defaulting to False, meaning the original DataFrame is untouched, and the operation returns a new DF.
  • When setting inplace = True, the operation might work on the original DF, but it might still work on a copy behind the scenes, and just reassign the reference when done.

pros of inplace = True:

  • Can be both faster and less memory hogging (the first link shows reset_index() runs twice as fast and uses half the peak memory!).

pros of inplace = False :

  • Allows chained/functional syntax: df.dropna().rename().sum()... which is nice, and offers a chance for lazy evaluation or a more efficient re-ordering (though I don't think Pandas is doing this).
  • When using inplace = True on an object which is potentially a slice/view of an underlying DF, Pandas has to do a SettingWithCopy check, which is expensive. inplace = False avoids this.
  • Consistent & predictable behavior behind the scenes.

So, putting the copy-vs-view issue aside, it seems more performant to always use inplace = True, unless specifically writing a chained statement. But that's not the default Pandas opt for, so what am I missing?

OmerB
  • 4,134
  • 3
  • 20
  • 33
  • 4
    My understanding is that this semantic follows numpy which is what pandas is built-on/modeled on. There isn't any performance gain to passing inplace=True versus self-assignment according to the devs (this was a comment on some question I can't find). Whether you're really working on a view or not is really tricky and error-prone, there isn't so far a fool-proof method, other than raising a warning where it's obvious and so `inplace=True` may not do what you expect – EdChum Aug 08 '17 at 14:35
  • @EdChum: _There isn't any performance gain to passing inplace=True versus self-assignment according to the devs_. Please update if you find the source (or if a Pandas dev can validate this...). In the link I posted they do show nice gains, and I'm sure you'll agree there's at least a **potential** for gains in this approach - at least in memory usage. – OmerB Aug 08 '17 at 17:04
  • Jeff reback (one of the core pandas devs ) commented on this on some question some while back but I can't find a reference currently, even regarding memory usage the difference is minimal, irrespective of this the potential erroneous situations that can arise make it difficult to mandate – EdChum Aug 08 '17 at 17:09
  • Ok, cool. Then it might be the second link in my question, the answer there is from a user called Jeff :-) – OmerB Aug 08 '17 at 17:11
  • 1
    Reset index would be faster inplace as the index object can be quickly replaced with a rangeindex, it's more assignment of columns and or data frames where the cost performance minimises. Also I'm answering this on my mobile whilst cooking so excuse the brevity – EdChum Aug 08 '17 at 17:11
  • Yep that's him, the design decisions have been carefully thought out, the common use cases are more like either self assignment or to calculate some result and do stuff with it, in those scenarios retiring a copy is the more understandable and safer option and I agree – EdChum Aug 08 '17 at 17:15
  • 1
    I *STRONGLY DISAGREE* which whoever marked this as opinion-based. I think there are certainly cases where it is hard to argue one way or the other, but in this case it's open and shut - there are more cons than pros to the use of this argument to the extent that it is fast approaching "antipattern" status. If that isn't enough to convince you, its deprecation is also being planned. – cs95 Jul 13 '20 at 10:15
  • 1
    @OmerB just curious, is there a reason you've abstained from accepting an answer? Not that there's anything wrong against it - you're free to accept or not - but just wondering if there's something in the answers that is lacking. – cs95 Jul 14 '20 at 07:14
  • 1
    @cs95 - Yes, that's intentional. My post starts with a list of questions, most are still unanswered. An ideal answer would provide some background on why inplace was available to begin with, when is it useful (again - it was made available for some purpose) and explain the technical reasons why it sometimes does a copy. Also, I link in my question to one example of concrete performance gain, is it the only one? That ideal answer would map these cases out. I've been planning (for much too long now...) to revise the question, summarize the discussion from here and Github, and open for bounty. – OmerB Jul 14 '20 at 10:04

2 Answers2

95

In pandas, is inplace = True considered harmful, or not?

Yes, it is. Not just harmful. Quite harmful. This GitHub issue is proposing the inplace argument be deprecated api-wide sometime in the near future. In a nutshell, here's everything wrong with the inplace argument:

  • inplace, contrary to what the name implies, often does not prevent copies from being created, and (almost) never offers any performance benefits
  • inplace does not work with method chaining
  • inplace can lead to the dreaded SettingWithCopyWarning when called on a DataFrame column, and may sometimes fail to update the column in-place

The pain points above are all common pitfall for beginners, so removing this option will simplify the API greatly.


We take a look at the points above in more depth.

Performance
It is a common misconception that using inplace=True will lead to more efficient or optimized code. In general, there are no performance benefits to using inplace=True (but there are rare exceptions which are mostly a result of implementation detail in the library and should not be used as a crutch to advocate for this argument's usage). Most in-place and out-of-place versions of a method create a copy of the data anyway, with the in-place version automatically assigning the copy back. The copy cannot be avoided.

Method Chaining
inplace=True also hinders method chaining. Contrast the working of

result = df.some_function1().reset_index().some_function2()

As opposed to

temp = df.some_function1()
temp.reset_index(inplace=True)
result = temp.some_function2()

Unintended Pitfalls
One final caveat to keep in mind is that calling inplace=True can trigger the SettingWithCopyWarning:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning: 
# A value is trying to be set on a copy of a slice from a DataFrame

Which can cause unexpected behavior.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 14
    It hinders method chaining because the API returns None when `inplace` is True. But that's an API design decision, not an inherent limitation of in-place operations. – NicholasM Feb 01 '20 at 19:37
  • 2
    @NicholasM It's a bad API design decision, which is why it is considered for deprecation. – cs95 Feb 01 '20 at 20:40
  • 6
    chained and in place operations are inherently different approaches- you can argue for one or the other but the fact that they are incompatible with each other is not really a weakness of either (edit: well I suppose you could design the API to do the operation in place AND return a pointer to the input. I guess I see what you're saying) – avigil Feb 11 '20 at 04:40
  • 3
    FWIW the Github issue has been open since May 2017 with 41 comments, so it doesn't seem to be going anywhere fast – Addison Klinke Mar 24 '21 at 16:17
  • @NicholasM Yes, it is an API design decision, but that's exactly what `inplace` is. Do you want to return the resulting df (`inplace=False`), or do you want to return `None` and edit the df inplace (`inplace=True`). What are you proposing, that `inplace` both alters the dataframe and returns it? This means if you chain methods onto the df, you'll be creating another copy of the df on each chained method. – Mike Williamson Dec 06 '22 at 18:07
11

If inplace was the default then the DataFrame would be mutated for all names that currently reference it.

A simple example, say I have a df:

df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})

Now it's very important that DataFrame retains that row order - let's say it's from a data source where insertion order is key for instance.

However, I now need to do some operations which require a different sort order:

def f(frame):
    df = frame.sort_values('a')
    # if we did frame.sort_values('a', inplace=True) here without
    # making it explicit - our caller is going to wonder what happened
    # do something
    return df

That's fine - my original df remains the same. However, if inplace=True were the default then my original df will now be sorted as a side-effect of f() in which I'd have to trust the caller to remember to not do something in place I'm not expecting instead of deliberately doing something in place... So it's better that anything that can mutate an object in place does so explicitly to at least make it more obvious what's happened and why.

Even with basic Python builtin mutables, you can observe this:

data = [3, 2, 1]

def f(lst):
    lst.sort()
    # I meant lst = sorted(lst)
    for item in lst:
        print(item)

f(data)

for item in data:
    print(item)

# huh!? What happened to my data - why's it not 3, 2, 1?     
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • 2
    The python referencing system is an important point to raise, personally I don't have an issue with the current method of working but from what I remember this behaviour is the same as numpy and so it just follows this semantic +1 – EdChum Aug 08 '17 at 14:38
  • @Jon - that's true, but I'm not yet convinced. They could have simply made DFs immutable, but they didn't. Besides, expanding on your Python example - lists are far more ubiquitous than DataFrames, and still Python decided to make them mutable and count on the caller to be responsible, because there are gains for mutating them in-place. – OmerB Aug 08 '17 at 17:09
  • 2
    @OmerB immutable dataframes would be impractical for their use. In short I'm saying that having inplace be explicit you're leaving it to the developer to explicitly say "I know what I'm doing and I'm aware of the consequences of the scope this may impact". Which is more sensible than the reverse and having to know you should provide an option to stop other things potentially breaking. – Jon Clements Aug 08 '17 at 17:15