I'm more of an R user and have recently been "switching" to Python. So that means I'm way more used to the R way of dealing with things. In Python, the whole concept of mutability and passing by assignment is kind of hard to grasp at first.
I can easily understand the issues that mutability may lead to when using lists or dictionaries. However, when using pandas DataFrames, I find that mutability is specially difficult to understand.
For example: let's say I have a DataFrame (df
) with some raw data. I want to use a function that receives df
as a parameter and outputs a modified version of that df
, but keeping the original df
. If I wrote the function, maybe I can inspect it and be assured that it makes a copy of the input before applying any manipulation. However, if it's a function I don't know (let's say, from some package), should I always pass my input df
as df.copy()
?
In my case, I'm trying to write some custom function that transforms a df
using a WoE encoder. The data
parameter is a DataFrame with feature columns and a label column. It kinda looks like this:
def my_function(data, var_list, label_column):
encoder = category_encoders.WOEEncoder(cols=var_list) # var_list = cols to be encoded
fit_encoder = encoder.fit(
X=data[var_list],
y=data[label_column]
)
new_data = fit_encoder.transform(
data[var_list]
)
new_data[label_column] = data[label_column]
return new_data
So should I be passing data[var_list].copy()
instead of data[var_list]
? Should I assume that every function that receives a df
will modify it in-place or will it return a different object? I mean, how can I be sure that fit_encoder.transform
won't modify data
itself? I also learned that Pandas sometimes produces views and sometimes not, depending on the operation you apply to the whatever subset of the df
. So I feel like there's too much uncertainty surrounding operations on DataFrames.