2

I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.

I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.

For example: let's say I have a DataFrame (df) with some raw data. I want to use a function that receives df as a parameter and outputs a modified version of that df, but keeping the original df. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df as df.copy()?

In my case, I'm trying to write some custom function that transforms a df using a WoE encoder. The data parameter is a DataFrame with feature columns and a label column. It kinda looks like this:

def my_function(data, var_list, label_column):

    encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded

    fit_encoder = encoder.fit(
        X=data[var_list],
        y=data[label_column]
    )
    new_data = fit_encoder.transform(
        data[var_list]
    )
    new_data[label_column] = data[label_column]
    return new_data

So should I be passing data[var_list].copy() instead of data[var_list]? Should I assume that every function that receives a df will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform won't modify data itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df. So I feel like there's too much uncertainty surrounding operations on DataFrames.

eduardokapp
  • 1,612
  • 1
  • 6
  • 28
  • 1
    There's a **general** rule of thumb that a function or method that accepts an object and returns the same type of object modifies a copy of it. If it has no return value, it usually means the object is modified in-place. If you really want to be sure your object isn't modified, though, you can pass a copy. – sj95126 Sep 20 '21 at 20:13
  • 1
    " In Python, the whole concept of mutability and passing by reference (and not value) is kind of hard to grasp at first." **Python never passes anything by reference**. Nor by value. This is important to understand. [Read this answer to another question](https://stackoverflow.com/a/986145/5014455) for an accurate description of Python's evaluation strategy. – juanpa.arrivillaga Sep 20 '21 at 20:16
  • A very typical workflow with pandas is along the lines of `df = some_method(df)` in which case it doesn't matter because you want to modify the DataFrame. But if you need to do `df1 = some_method(df)` and you intend to use `df` again then you probably should do `df1 = some_method(df.copy())`. If your DataFrames are large, then inspect the methods to see if the copy is necessary, or if those functions already handle itand only `.copy` if it's necessary:https://stackoverflow.com/questions/62538804/how-is-pandas-dataframe-handled-across-multiple-custom-functions-when-passed-as – ALollz Sep 20 '21 at 20:17
  • Note, in general, most functions in the libraries you use shouldn't be modifying the dataframes in-place. Also, mutability works *exactly the same way* with regards to pandas objects, for your case, dataframes and series are *mutable* – juanpa.arrivillaga Sep 20 '21 at 20:17
  • 1
    "In general", "rule of thumb", so you're saying that yes, there is uncertainty. Is that it? – eduardokapp Sep 20 '21 at 20:19
  • Well, sure, it's always possible to find poorly documented code, or code that doesn't do what it says it should, or doesn't follow the usual conventions, but that can happen in any language. If you type ```help(numpy)``` there's a specific section on copies-vs-in-place operations. – sj95126 Sep 20 '21 at 20:30

1 Answers1

0

From the exercise shown on the website https://www.statology.org/pandas-copy-dataframe/ it shows that if you don't use .copy() when manipulating a subset of your dataframe, it could change values in your original dataframe as well. This is not what you want, so you should use .copy() when passing your dataframe to your function.

The example on the link I listed above really illustrates this concept well (and no I'm not affiliated with their site lol, I was just searching for this answer myself).

rpinto73
  • 41
  • 3