34

Pandas operations usually create a copy of the original dataframe. As some answers on SO point out, even when using inplace=True, a lot of operations still create a copy to operate on.

Now, I think I'd be called a madman if I told my colleagues that everytime I want to, for example, apply +2 to a list, I copy the whole thing before doing it. Yet, it's what Pandas does. Even simple operations such as append always reallocate the whole dataframe.

Having to reallocate and copy everything on every operation seems like a very inefficient way to go about operating on any data. It also makes operating on particularly large dataframes impossible, even if they fit in your RAM.

Furthermore, this does not seem to be a problem for Pandas developers or users, so much so that there's an open issue #16529 discussing the removal of the inplace parameter entirely, which has received mostly positive responses; some started getting deprecated since 1.0. It seems like I'm missing something. So, what am I missing?

What are the advantages of always copying the dataframe on operations, instead of executing them in-place whenever possible?

Note: I agree that method chaining is very neat, I use it all the time. However, I feel that "because we can method chain" is not the whole answer, since Pandas sometimes copies even in inplace=True methods, which are not meant to be chained. So, I'm looking some other answers for why this would be a reasonable default.

smci
  • 32,567
  • 20
  • 113
  • 146
Luiz Martins
  • 1,644
  • 10
  • 24
  • 7
    So as the issue of removing `inplace` mentions the reason it's being removed is that it is a misnomer. It _does_ create a copy it just hides away the reassignment. There is almost no difference between `df = df.some_operation)` and `df.some_operation(inplace=True)` There are (almost) no true inplace operations. In my opinion, this question is a great reason for _removing_ the `inplace` parameter, because it makes people _think_ they're not making copies when they are. – Henry Ecker Nov 15 '21 at 04:43
  • 3
    "inplace does not generally do anything inplace but makes a copy and reassigns the pointer" https://github.com/pandas-dev/pandas/issues/16529#issuecomment-323890422 and "there are __absolutely no performance benefits__ to using `inplace=True`" from [the linked answer](https://stackoverflow.com/a/59242208/15497888) by [cs95](https://stackoverflow.com/users/4909087/cs95) – Henry Ecker Nov 15 '21 at 04:44
  • 1
    @HenryEcker that isn't true. There's a big difference, `df = df.some_operation()` is not the same as `df.some_operation(inplace=True)`, because for the latter, *ever other place the dataframe is being referred to it changes, in the former, it doesn't*. Of course, the *underlying buffer* may or may not be re-allocated. – juanpa.arrivillaga Nov 15 '21 at 04:53
  • 9
    I don't really know if this question is answerable in some ways... We could get into how Pandas DataFrames actually store data and how individual block managers are used to group collections of variables of the same type into ndarrays by dtype and how it is not reasonable to reshape a DataFrame without rebuilding the block managers to structure them in the correct sequence to reduce total overall memory and fragmentation. But these are largely design decisions and the library _could_ have been designed differently... – Henry Ecker Nov 15 '21 at 04:54
  • 1
    @juanpa.arrivillaga Okay. Fair enough. What I meant was there is almost no difference in the number of copies or memory needed to do an "inplace" operation vs a not "inplace" operation given the context of pandas and standard use cases. (not considering the actual program logic that may or may not apply to overwriting the `self` of a class instance like the inplace operations do) – Henry Ecker Nov 15 '21 at 04:56
  • 1
    @HenryEcker Indeed they are design decisions, but surely they have good reasons to choose this specific design. I imagine that there are some major advantages of their implementation, such that being "forced" to copy on every operation is worth. In that case, I'm still interested in what design would lead to a library being forced to do that, and what you gain from it. – Luiz Martins Nov 15 '21 at 05:08
  • @LuizMartins, the link shared by HenryEcker is a long discussion on the inplace parameter, and might shed more light on Pandas design choices – sammywemmy Nov 15 '21 at 05:42
  • No it's not efficient for computers, but humans tend to make less mistakes with this design IMHO. Especially, with interactive data analysis environment, where you frequently go back and forth between different cells or abort the long computation, in place mutation can cause bugs that's hard to reason about. That cost is higher than the slight inefficiency incurred on computers, which is usually in milliseconds difference. – Alby Dec 29 '21 at 17:36
  • Let's separate the aspects of a) memory-inefficiency in not actually operating in-place under-the-hood b) fixing up the patchwork legacy mess in the API of `inplace` appearing in some commands but not others c) immutable(/FP) paradigm. And an alternative package that does most things in-place is [the Python port of datatable](https://datatable.readthedocs.io/en/latest/manual/comparison_with_pandas.html); at least it does in R. – smci Jan 09 '22 at 00:37
  • Related 2017 question: [Understanding inplace=True in pandas](https://stackoverflow.com/questions/43893457/understanding-inplace-true-in-pandas) – smci Jan 09 '22 at 00:45

1 Answers1

9

As evidenced here in the pandas documentation, "... In general we like to favor immutability where sensible." The Pandas project is in the camp of preferring immutable (stateless) objects over mutable (objects with state) to guide programmers into creating more scalable / parallelizable data processing code. They are guiding the users by making the 'inplace=False' behavior the default.

In this software engineering stack exchange Peter Torok discusses the pros and cons between mutable and immutable object programming really nicely. https://softwareengineering.stackexchange.com/a/151735

In summary some software engineers feel that objects that are immutable (unchanging) lead to

  • less errors in the code - because object states are easy to lose track of and hard to track down
  • increased scalability - it is easier to write multithreaded code, since one thread will not inadvertently modify the value contained by an object in another thread
  • more concise code - since code is forced to be written in a functional programming and more mathematical style

I will agree that this does have it's inefficiencies since constantly making copies of the same objects for minor changes does not seem ideal. It has other benefits noted above.

lane
  • 766
  • 5
  • 20