0

I'm coming from an R background where I didn't run into this issue.

Generally in the past I've made functions that act upon a dataframe and return some modified version of the dataframe. For example:

df=pd.DataFrame({"a":[1,2,3,4,5], "b":[6,7,8,9,10]})

def multiply_function(dataset):
    dataset['output']=dataset.iloc[:,1] * dataset.iloc[:,0]
    return(dataset)

new_df=multiply_function(df)
new_df # looks good!
df # I would expect that df stays the same and isn't updated with the new column

I'm trying to convert a good amount of functions or code from one language to another. I'd like to avoid having this issue happen so that df is NOT updated globally because of what happens inside a function.

This is particularly important when I'm re-running code or modifying code because a dataframe may not be valid to run through a function twice.

I have seen usage of

dataset = dataset.copy()

as the first line of code...but is this really ideal? Is there a better way around this? I was thinking that this would really blow up the amount of data in memory when working with large datasets?

Thank you!

runningbirds
  • 6,235
  • 13
  • 55
  • 94
  • 1
    If you expect a new dataset then doing `.copy()` would create it, but isn't that what you are asking for? I'm not sure it would "blow up" any more than what you are wanting to do in the first place? – MyNameIsCaleb Sep 25 '19 at 17:54
  • [This](https://stackoverflow.com/questions/48173980/pandas-knowing-when-an-operation-affects-the-original-dataframe) is a good discussion on this topic. – MyNameIsCaleb Sep 25 '19 at 17:55
  • 1
    How do you imagine you can have two different dataframes without having... Two different dataframes? – juanpa.arrivillaga Sep 25 '19 at 18:21
  • Anyway, this doesn't really have to do with local vs global scope. You are using mutator methods on an object referenced by a local variable, so of course, these modifications will be seen by any other variables that reference that object – juanpa.arrivillaga Sep 25 '19 at 18:23
  • @MyNameIsCaleb I'm wondering if there is something else I am missing on how to code this? I'm used to the dataframe being passed in as parameter to a function not being impacted since all the operations happen closed inside a function. I will read the link posted – runningbirds Sep 25 '19 at 18:38

0 Answers0