0

Probably it's very common to use inplace argument in Pandas functions when manipulating dataframes.

Inplace is an argument used in different functions. Some functions in which inplace is used as an attributes like set_index(), dropna(), fillna(), reset_index(), drop(), replace() and many more. The default value of this attribute is False and it returns the copy of the object.

I want to know in detail when it's good practice to use inplace in pandas functions and when you shouldn't do that also, the reason for that. Can you demonstrate in examples to be a reference since this issue is very common in using pandas functions.

As Example:

df.drop(columns=[your_columns], inplace=True)

In which cases using inplace with drop is recommended. Also if some variables like list depending on the dataframe. changing it inplace will affect the result of other variables that depending on it. Another issue which is using inplace prevent method chaining on pandas dataframe.

Oghli
  • 2,200
  • 1
  • 15
  • 37

2 Answers2

5

I will give examples, but the better way to look at it is from style.

First, for any production code or code which can be run in any arbitrarily parallelized way, you don't want to change anything in place.

There is one major philosophical difference between functional programming and object-oriented programming: functional programming has no side effects. What does that mean? It means that if I have a function or method, let's take df.drop() for a tangible example, then using drop in a purely functional fashion will only return a result, it will do nothing else.

Let's make a dataframe:

>>> df = pd.DataFrame({"name": ["Alice", "Bob", "Candi"],
"job": ["CFO", "Accountant", "Developer"],
"department": ["Executive", "Accounting", "Product"]})
>>> df
    name         job  department
0  Alice         CFO   Executive
1    Bob  Accountant  Accounting
2  Candi   Developer     Product

No Side Effects (inplace=False)

Now, if I call drop in a functional way, all that happens is a new dataframe is returned with the missing column(s):

>>> df.drop(columns = "job", inplace=False)
    name  department
0  Alice   Executive
1    Bob  Accounting
2  Candi     Product

Notice that I am returning the result, which is the dataframe. To be clear, I can do this:

>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
    name  department
0  Alice   Executive
1    Bob  Accounting
2  Candi     Product

Notice that new_df has been assigned to the returned result of the drop method.

With Side Effects (inplace=True)

>>> df.drop(columns="job", inplace=True)
>>> 

Notice that nothing is returned! The return of this method, in fact, is None.

But something did happen:

>>> df
    name  department
0  Alice   Executive
1    Bob  Accounting
2  Candi     Product

If I ask for the dataframe, we can see that df was in fact changed so that the job column is missing. But this entirely happened as a side effect, not as a return.

To prove that nothing is being returned, let's try this again (with a different column, for reasons mentioned below) and assign the result of the inplace method to a new variable:

>>> not_much = df.drop(columns="name", inplace=True)
>>> type(not_much)
<class 'NoneType'>

As you can see, the variable not_much is of NoneType, which means that None was returned.

Philosophy - Or, "when to use or not use"

Software engineering has changed over the years, and parallel activity is a much more common thing now. If you run big data jobs on Spark, or even if you run pandas on your single laptop, you can configure tasks to run multi-threaded, on multiple processes, asynchronously, as map-reduce jobs, etc.

Because of this parallel actions, you often won't know what will happen first and what will happen second. You want as many actions as possible either not to change state, or to change state in an atomic fashion.

Now let's revisit the df.drop, repeating it multiple times. Imagine you have a big-data job that is network-limited -- often the case -- and you just ask 10 machines to do the same task, and you accept the answer from whichever machine returns the answer first. This is a common way to deal with network inconsistencies.

Inplace:

>>> df
    name         job  department
0  Alice         CFO   Executive
1    Bob  Accountant  Accounting
2  Candi   Developer     Product
>>> df.drop(columns="job", inplace=True)
>>> df.drop(columns="job", inplace=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: "['job'] not found in axis"
>>> 

I just ran the same job twice, and got different answers, one of which caused an error. That is not good for parallel jobs running.

Not Inplace:

>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
    name  department
0  Alice   Executive
1    Bob  Accounting
2  Candi     Product
>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
    name  department
0  Alice   Executive
1    Bob  Accounting
2  Candi     Product

No matter how often that code is run above, new_df will always equal the same thing.

When to use one versus the other

I would never use inplace=True unless it was in a one-off Jupyter notebook, a homework assignment, or something else so far from a production environment.

Final Note

I started off saying something about "functional vs object-oriented". That is how it is framed, but I don't like that comparison because it starts flame wars, and object oriented does not have to have side effects.

It's just that functional cannot have side effects, and object-oriented often does.

I prefer to say "side effects vs no side effects". That choice is easy: always prevent side effects whenever possible, recognizing it is not always possible (despite what Haskel suggests).

Mike Williamson
  • 4,915
  • 14
  • 67
  • 104
  • 1
    Very detailed and Informative answer. Thank you Mike for providing this great explanation for inplace use cases in Pandas. – Oghli Dec 12 '22 at 06:49
  • You're very welcome! This is all stuff which was not at all obvious to me when I started out... – Mike Williamson Dec 12 '22 at 08:48
0

The inplace argument is used in many Pandas functions to specify whether the function should modify the original dataframe or return a new dataframe with the modifications applied. By default, the inplace argument is set to False, which means that the function will return a new dataframe.

In most cases, it is good practice to use the inplace argument when you want to modify a dataframe in place, rather than creating a new dataframe. This can save memory and improve performance, especially when working with large datasets. For example, you can use the inplace argument with the drop() function to remove rows or columns from a dataframe in place, like this:

df.drop(columns=["column1", "column2"], inplace=True)

In some cases, however, it is better not to use the inplace argument. For example, if you want to keep a copy of the original dataframe before applying any modifications, you should not use the inplace argument. This way, you can use the original dataframe for reference or comparison purposes. In this case, you can simply omit the inplace argument, like this:

df2 = df.drop(columns=["column1", "column2"])

Another case where you should avoid using the inplace argument is when you are not sure whether the function will modify the dataframe in the way you expect. For example, the fillna() function can replace missing values with a specified value, but it may not always produce the desired result. In this case, it is better to first create a copy of the dataframe using the copy() method, and then apply the fillna() function to the copy, like this:

df2 = df.copy()
df2.fillna(value=0, inplace=True)

Overall, the inplace argument is a useful tool for modifying dataframes in place, but it should be used with care to avoid unintended consequences. It is always a good idea to create a copy of the dataframe before applying any modifications, and to carefully test the results to ensure that they are correct.

4pi
  • 11
  • 1
  • 2
  • `inplace`, for the moment, [does not save memory](https://github.com/pandas-dev/pandas/issues/16529). (Note, that article also mentions it will be -- or at least may be -- deprecated in the future.) But there are [some good tips](https://marcobonzanini.com/2021/09/15/tips-for-saving-memory-with-pandas/) for saving memory. I would argue that if you're worried about memory, and it isn't just a homework assignment, use [Spark](https://spark.apache.org/docs/latest/api/python/) instead. – Mike Williamson Dec 21 '22 at 20:04