I'm coming from an R background where I didn't run into this issue.
Generally in the past I've made functions that act upon a dataframe and return some modified version of the dataframe. For example:
df=pd.DataFrame({"a":[1,2,3,4,5], "b":[6,7,8,9,10]})
def multiply_function(dataset):
dataset['output']=dataset.iloc[:,1] * dataset.iloc[:,0]
return(dataset)
new_df=multiply_function(df)
new_df # looks good!
df # I would expect that df stays the same and isn't updated with the new column
I'm trying to convert a good amount of functions or code from one language to another. I'd like to avoid having this issue happen so that df is NOT updated globally because of what happens inside a function.
This is particularly important when I'm re-running code or modifying code because a dataframe may not be valid to run through a function twice.
I have seen usage of
dataset = dataset.copy()
as the first line of code...but is this really ideal? Is there a better way around this? I was thinking that this would really blow up the amount of data in memory when working with large datasets?
Thank you!