0

I'm using Python 3.6 and Pandas 0.20.3.

I'm sure this must be addressed somewhere, but I can't seem to find it. I alter a dataframe inside a function by adding columns; then I restore the dataframe to the original columns. I don't return the dataframe. The added columns stay. I could understand if I add columns inside the function and they are not permanent AND updating the dataframe does not work. I'd also understand if adding columns altered the dataframe and assigning the dataframe also stuck. Here is the code:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 5))
df

which gives

    0               1           2             3       4
0   0.406779    -0.481733   -1.187696   -0.210456   -0.608194
1   0.732978    -0.079787   -0.051720   1.097441    0.089850
2   1.859737    -1.422845   -1.148805   0.254504    1.207134
3   0.074400    -1.352875   -1.341630   -1.371050   0.005505
4   -0.102024   -0.905506   -0.165681   2.424180    0.761963
5   0.400507    -0.069214   0.228971    -0.079805   -1.059972
6   1.284812    0.843705    -0.885566   1.087703    -1.006714
7   0.135243    0.055807    -1.217794   0.018104    -1.571214
8   -0.524320   -0.201561   1.535369    -0.840925   0.215584
9   -0.495721   0.284237    0.235668    -1.412262   -0.002418

Now, I create a function:

def mess_around(df):
    cols = df.columns
    df['extra']='hi'
    df = df[cols]

then run it and display dataframe:

mess_around(df)
df

which gives:

        0         1             2           3          4       extra
0   0.406779    -0.481733   -1.187696   -0.210456   -0.608194   hi
1   0.732978    -0.079787   -0.051720   1.097441    0.089850    hi
2   1.859737    -1.422845   -1.148805   0.254504    1.207134    hi
3   0.074400    -1.352875   -1.341630   -1.371050   0.005505    hi
4   -0.102024   -0.905506   -0.165681   2.424180    0.761963    hi
5   0.400507    -0.069214   0.228971    -0.079805   -1.059972   hi
6   1.284812    0.843705    -0.885566   1.087703    -1.006714   hi
7   0.135243    0.055807    -1.217794   0.018104    -1.571214   hi
8   -0.524320   -0.201561   1.535369    -0.840925   0.215584    hi
9   -0.495721   0.284237    0.235668    -1.412262   -0.002418   hi

I know I can solve the problem by return ts. So I can fix the problem. I want to understand where I am going wrong. I suspect that the scope of the variable ts is inside the function; it is given a pointer but that does not change because of scope. Yet the column assignment is using the pointer that is passed in and therefore impacts the dataframe "directly". Is that correct?

EDIT: For those that might want to address the dataframe in place, I've added:

for c in ts.columns:
    if c not in cols:
        del ts[c]

I'm guessing if I return the new dataframe, then there will be a potentially large dataframe that will have to be dealt with by garbage collection.

cs95
  • 379,657
  • 97
  • 704
  • 746
kdragger
  • 436
  • 1
  • 5
  • 16

1 Answers1

1

To understand what happens, you should know the difference between passing attributes to functions by value versus passing them by reference:


You pass a variable df to your function messing_around. The function modifies the original dataframe in-place by adding a column.

This subsequent line of code seems to be the cause for confusion here:

df = df[cols]

What happens here is that the variable df originally held a reference to your dataframe. But, the reassignment causes the variable to point to a different object - your original dataframe is not changed.

Here's a simpler example:

def foo(l):
    l.insert(0, np.nan)   # original modified
    l = [4, 5, 6]         # reassignment - no change to the original, 
                          # but the variable l points to something different

lst = [1, 2, 3]    
foo(lst)

print(lst)
[nan, 1, 2, 3]            # notice here that the insert modifies the original,
                          # but not the reassignment
cs95
  • 379,657
  • 97
  • 704
  • 746