I will give examples, but the better way to look at it is from style.
First, for any production code or code which can be run in any arbitrarily parallelized way, you don't want to change anything in place.
There is one major philosophical difference between functional programming and object-oriented programming: functional programming has no side effects. What does that mean? It means that if I have a function or method, let's take df.drop()
for a tangible example, then using drop
in a purely functional fashion will only return a result, it will do nothing else.
Let's make a dataframe:
>>> df = pd.DataFrame({"name": ["Alice", "Bob", "Candi"],
"job": ["CFO", "Accountant", "Developer"],
"department": ["Executive", "Accounting", "Product"]})
>>> df
name job department
0 Alice CFO Executive
1 Bob Accountant Accounting
2 Candi Developer Product
No Side Effects (inplace=False
)
Now, if I call drop
in a functional way, all that happens is a new dataframe is returned with the missing column(s):
>>> df.drop(columns = "job", inplace=False)
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
Notice that I am returning the result, which is the dataframe. To be clear, I can do this:
>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
Notice that new_df
has been assigned to the returned result of the drop
method.
With Side Effects (inplace=True
)
>>> df.drop(columns="job", inplace=True)
>>>
Notice that nothing is returned! The return of this method, in fact, is None
.
But something did happen:
>>> df
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
If I ask for the dataframe, we can see that df
was in fact changed so that the job
column is missing. But this entirely happened as a side effect, not as a return.
To prove that nothing is being returned, let's try this again (with a different column, for reasons mentioned below) and assign the result of the inplace
method to a new variable:
>>> not_much = df.drop(columns="name", inplace=True)
>>> type(not_much)
<class 'NoneType'>
As you can see, the variable not_much
is of NoneType
, which means that None
was returned.
Philosophy - Or, "when to use or not use"
Software engineering has changed over the years, and parallel activity is a much more common thing now. If you run big data jobs on Spark, or even if you run pandas on your single laptop, you can configure tasks to run multi-threaded, on multiple processes, asynchronously, as map-reduce jobs, etc.
Because of this parallel actions, you often won't know what will happen first and what will happen second. You want as many actions as possible either not to change state, or to change state in an atomic fashion.
Now let's revisit the df.drop
, repeating it multiple times. Imagine you have a big-data job that is network-limited -- often the case -- and you just ask 10 machines to do the same task, and you accept the answer from whichever machine returns the answer first. This is a common way to deal with network inconsistencies.
Inplace
:
>>> df
name job department
0 Alice CFO Executive
1 Bob Accountant Accounting
2 Candi Developer Product
>>> df.drop(columns="job", inplace=True)
>>> df.drop(columns="job", inplace=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
raise KeyError(f"{list(labels[mask])} not found in axis")
KeyError: "['job'] not found in axis"
>>>
I just ran the same job twice, and got different answers, one of which caused an error. That is not good for parallel jobs running.
Not Inplace
:
>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
>>> new_df = df.drop(columns="job", inplace=False)
>>> new_df
name department
0 Alice Executive
1 Bob Accounting
2 Candi Product
No matter how often that code is run above, new_df
will always equal the same thing.
When to use one versus the other
I would never use inplace=True
unless it was in a one-off Jupyter notebook, a homework assignment, or something else so far from a production environment.
Final Note
I started off saying something about "functional vs object-oriented". That is how it is framed, but I don't like that comparison because it starts flame wars, and object oriented does not have to have side effects.
It's just that functional cannot have side effects, and object-oriented often does.
I prefer to say "side effects vs no side effects". That choice is easy: always prevent side effects whenever possible, recognizing it is not always possible (despite what Haskel suggests).