2

Trying to understand variable scope with a function call.

Code to discuss.

import numpy as np
import pandas as pd

# Function to add a column with random stuff to a dataframe 
def Add_a_column(df):
    df['Col2']= np.sign(np.random.randn(len(df)))
    return df

# Create a dataframe with random stuff
df_full = pd.DataFrame(data=np.sign(np.random.randn(5)), columns=['Col1'])

df_another = Add_a_column(df_full)
  • df_full is global. Correct?
  • df_another is global. Correct?
  • df is local to Add_a_column. Correct?

When I execute the code, the column get's added to df_full

In[8]: df_full
Out[8]: 
   Col1  Col2
0  -1.0  -1.0
1   1.0  -1.0
2  -1.0   1.0
3   1.0   1.0
4   1.0   1.0

How do I avoid df_full being modified by the function?

mapesd
  • 31
  • 2
  • 3
    The *name* `df` is local to the function, but `df` and `df_full` refer to the *same* object. – Daniel Roseman Dec 29 '17 at 19:19
  • sounds like you want to clone df_full in the function, manipulate the new object, and then send that back. – Fallenreaper Dec 29 '17 at 19:23
  • Expanding a bit what @DanielRoseman said, and without knowing anything about Pandas, I imagine you need to copy the `df_full` before passing it to the `Add_a_column` function? (see https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html ) and read why this happens here: https://stackoverflow.com/q/2612802/289011 – Savir Dec 29 '17 at 19:23
  • @BorrajaX or clone in the function. Im not sure what his end goal is. – Fallenreaper Dec 29 '17 at 19:24
  • @BorrajaX You are correct but in pandas this might actually be a bit of a shock for the OP since a _lot_ of operations require `inplace=True` to actually take effect in such a way, so I can see where their confusion comes from :) – roganjosh Dec 29 '17 at 19:36
  • The OP should read https://nedbatchelder.com/text/names.html – chepner Dec 29 '17 at 19:44

2 Answers2

1

df_full's reference is passed into the function. So df and df_full are the same object, meaning they both get modified when one is modified.

You need to change your function to:

def Add_a_column(df):
    df = df.copy()
    df['Col2']= np.sign(np.random.randn(len(df)))
    return df

Alternatively, you could call the function with a copied function like Add_a_column(df.copy())

rassar
  • 5,412
  • 3
  • 25
  • 41
0
  • df_full is global. Correct?
  • df_another is global. Correct?
  • df is local to Add_a_column. Correct?

It sounds like you understand scope just fine. Each variable had the scope you describe.

The piece you are missing is that df_full and df refer too the same object. When you make changes to that object with one variable, the changes are visible when you access that object with the other variable.

Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268