I'm using Python 3.6 and Pandas 0.20.3.
I'm sure this must be addressed somewhere, but I can't seem to find it. I alter a dataframe inside a function by adding columns; then I restore the dataframe to the original columns. I don't return the dataframe. The added columns stay. I could understand if I add columns inside the function and they are not permanent AND updating the dataframe does not work. I'd also understand if adding columns altered the dataframe and assigning the dataframe also stuck. Here is the code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 5))
df
which gives
0 1 2 3 4
0 0.406779 -0.481733 -1.187696 -0.210456 -0.608194
1 0.732978 -0.079787 -0.051720 1.097441 0.089850
2 1.859737 -1.422845 -1.148805 0.254504 1.207134
3 0.074400 -1.352875 -1.341630 -1.371050 0.005505
4 -0.102024 -0.905506 -0.165681 2.424180 0.761963
5 0.400507 -0.069214 0.228971 -0.079805 -1.059972
6 1.284812 0.843705 -0.885566 1.087703 -1.006714
7 0.135243 0.055807 -1.217794 0.018104 -1.571214
8 -0.524320 -0.201561 1.535369 -0.840925 0.215584
9 -0.495721 0.284237 0.235668 -1.412262 -0.002418
Now, I create a function:
def mess_around(df):
cols = df.columns
df['extra']='hi'
df = df[cols]
then run it and display dataframe:
mess_around(df)
df
which gives:
0 1 2 3 4 extra
0 0.406779 -0.481733 -1.187696 -0.210456 -0.608194 hi
1 0.732978 -0.079787 -0.051720 1.097441 0.089850 hi
2 1.859737 -1.422845 -1.148805 0.254504 1.207134 hi
3 0.074400 -1.352875 -1.341630 -1.371050 0.005505 hi
4 -0.102024 -0.905506 -0.165681 2.424180 0.761963 hi
5 0.400507 -0.069214 0.228971 -0.079805 -1.059972 hi
6 1.284812 0.843705 -0.885566 1.087703 -1.006714 hi
7 0.135243 0.055807 -1.217794 0.018104 -1.571214 hi
8 -0.524320 -0.201561 1.535369 -0.840925 0.215584 hi
9 -0.495721 0.284237 0.235668 -1.412262 -0.002418 hi
I know I can solve the problem by return ts. So I can fix the problem. I want to understand where I am going wrong. I suspect that the scope of the variable ts is inside the function; it is given a pointer but that does not change because of scope. Yet the column assignment is using the pointer that is passed in and therefore impacts the dataframe "directly". Is that correct?
EDIT: For those that might want to address the dataframe in place, I've added:
for c in ts.columns:
if c not in cols:
del ts[c]
I'm guessing if I return the new dataframe, then there will be a potentially large dataframe that will have to be dealt with by garbage collection.