3

Sorry if I am doing something stupid, but I am very puzzled by this issue: I pass a DataFrame to a function, and inside that function I add a column and drop it. Nothing strange until here, but after the function has finished the DataFrame of the global namescope is showing the added&dropped column. If I declare the DF as global, this is not happening...

This test code is showing the issue in the four cases resulting from the combination of Python 3.3.3/2.7.6 and pandas 0.13.0/0.12.0:

#!/usr/bin/python
import pandas as pd

# FUNCTION DFcorr
def DFcorr(df):
    # Calculate column of accumulated elements
    df['SUM']=df.sum(axis=1)
    print('DFcorr: DataFrame after add column:')
    print(df)
    # Drop column of accumulated elements
    df=df.drop('SUM',axis=1)
    print('DFcorr: DataFrame after drop column:')
    print(df)  

# FUNCTION globalDFcorr
def globalDFcorr():
    global C
    # Calculate column of accumulated elements
    C['SUM']=C.sum(axis=1)
    print('globalDFcorr: DataFrame after add column:')
    print(C)
    # Drop column of accumulated elements
    print('globalDFcorr: DataFrame after drop column:')
    C=C.drop('SUM',axis=1)
    print(C)  

######################### MAIN #############################
C = pd.DataFrame.from_items([('A', [1, 2]), ('B', [3 ,4])], orient='index', columns['one', 'two'])
print('\nMAIN: Initial DataFrame:')
print(C)
DFcorr(C)
print('MAIN: DataFrame after call to DFcorr')
print(C)

C = pd.DataFrame.from_items([('A', [1, 2]), ('B', [3 ,4])], orient='index', columns=['one', 'two'])
print('\nMAIN: Initial DataFrame:')
print(C)
globalDFcorr()
print('MAIN: DataFrame after call to globalDFcorr')
print(C)

And here you are the output:

MAIN: Initial DataFrame:
   one  two
A    1    2
B    3    4

[2 rows x 2 columns]
DFcorr: DataFrame after add column:
   one  two  SUM
A    1    2    3
B    3    4    7

[2 rows x 3 columns]
DFcorr: DataFrame after drop column:
   one  two
A    1    2
B    3    4

[2 rows x 2 columns]
MAIN: DataFrame after call to DFcorr
   one  two  SUM
A    1    2    3
B    3    4    7

[2 rows x 3 columns]

MAIN: Initial DataFrame:
   one  two
A    1    2
B    3    4

[2 rows x 2 columns]
globalDFcorr: DataFrame after add column:
   one  two  SUM
A    1    2    3
B    3    4    7

[2 rows x 3 columns]
globalDFcorr: DataFrame after drop column:
   one  two
A    1    2
B    3    4

[2 rows x 2 columns]
MAIN: DataFrame after call to globalDFcorr
   one  two
A    1    2
B    3    4

[2 rows x 2 columns]

What am I missing? Many thanks!

khyox
  • 1,276
  • 1
  • 20
  • 22

1 Answers1

4

Note this line in DFCorr:

df=df.drop('SUM',axis=1)

The df.drop method returns a new DataFrame. It does not mutate the original df.

Inside DFcorr, df is just a local variable. Assignments to df do not affect the global variable C. Only mutations of df would affect C.

So, you could make DFcorr behave more like globalDFcorr by changing that line to:

df.drop('SUM',axis=1, inplace=True)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks for the reply. Should then I understand that when using the DataFrame identifier in the function scope (`df`in this case) it could arbitrarily refer to the global variable or to the local one? I mean, from the answer, should I understand that in `df['SUM']=df.sum(axis=1)` the `df` is affecting the global variable while in `df=df.drop('SUM',axis=1)` and in `print(df)` the `df` is referring the local variable? – khyox Jan 30 '14 at 15:22
  • I let here my reasoning in order you could check if I have properly understood your answer: If I think in `df` as a C/C++ pointer, at the beginning of the called function it points to the DataFrame in the global scope (here `C`) and when doing `df['SUM']=df.sum(axis=1)` it is so referring to `C`, but when doing `df=df.drop('SUM',axis=1)`, then, `df` becomes to point to a new DataFrame that is local to the function. Is this reasoning right? Many thanks. – khyox Jan 30 '14 at 15:38
  • @khyox: Yes, I think you have it! Here is a explanation of Python's [pass by assignment](http://stackoverflow.com/a/8140747/190597) function calling paradigm. – unutbu Jan 30 '14 at 19:04
  • Under the hood, Python translates `df['SUM'] = ...` into a call to `df.__setitem__('SUM', ...)` which mutates `df`. Thus `df` keeps pointing to the same object that `C` points to. But `df = ...` reassigns `df` to point to a new object, so further mutation of `df` would no longer affect `C`. – unutbu Jan 30 '14 at 19:23
  • Thank you very much for the link and your explanation! Both succinct and clear. – khyox Jan 30 '14 at 19:30