1

I am attempting to pass a data frame through some commands (preparing a series of arguments for a function). However, when I assign a data frame to a different data frame, this assignment seems to work as equivalency. In other words, after the assignment of a data frame to a new one, all changes apply to the original one as well. What is a good way to preserve the original data frame in its original state, so that it can be re-assigned to other commands, for other changes.

Please see below for an example.

# Merge several dataframes

df1 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'eTIV': [1.12, 2.22, 3.43, 5.43], })
df2 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'Ear_Vol': [5, 6, 7, 8]})
df3 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'Nose': [1, 2, 3, 5], })
df4 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'Eye_Vol': [1, 2, 3, 5], })
df5 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'Finger': [1.3, 2.123, 3.4, 5.5], })

dfs = [df1, df2, df3, df4,df5]

df_final = reduce(lambda left,right: pd.merge(left,right,on='ID'), dfs)

df_final

 ID eTIV    Ear_Vol Nose    Eye_Vol Finger
0   Mary    1.12    5   1   1   1.300
1   Mike    2.22    6   2   2   2.123
2   Barry   3.43    7   3   3   3.400
3   Scotty  5.43    8   5   5   5.500

Assignment of the data frame to a different data frame and manipulations:

df = df_final
df_raw = df
df_raw.columns = df_raw.columns.str.replace(r"_Vol", "_Vol_Raw")
df_raw = pd.DataFrame(data = df_raw, columns= df_raw.columns)

New data frame (as expected):

df_raw
ID  eTIV    Ear_Vol_Raw Nose    Eye_Vol_Raw Finger
0   Mary    1.12    5   1   1   1.300
1   Mike    2.22    6   2   2   2.123
2   Barry   3.43    7   3   3   3.400
3   Scotty  5.43    8   5   5   5.500

Original data frame, for some reason is altered as well (why does assignment alter the original here?):

df

    ID  eTIV    Ear_Vol_Raw Nose    Eye_Vol_Raw Finger
0   Mary    1.12    5   1   1   1.300
1   Mike    2.22    6   2   2   2.123
2   Barry   3.43    7   3   3   3.400
3   Scotty  5.43    8   5   5   5.500
arkadiy
  • 746
  • 1
  • 10
  • 26
  • 2
    assign with a `.copy`. As for why the original one gets altered, that is because [names refer to values](https://nedbatchelder.com/text/names.html) in python. assignment just gives 2 labels both pointing to the same underlying dataframe. – Paritosh Singh Feb 13 '19 at 21:58
  • See the similar question [Strange behavior with DataFrame copy](https://stackoverflow.com/questions/50368386/strange-behavior-with-dataframe-copy). – Xukrao Feb 13 '19 at 22:09

2 Answers2

3

If you wish to copy a dataframe and create a new object, use the .copy.

# Merge several dataframes
import pandas as pd
from functools import reduce
df1 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'eTIV': [1.12, 2.22, 3.43, 5.43], })
df2 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'Ear_Vol': [5, 6, 7, 8]})
df3 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'Nose': [1, 2, 3, 5], })
df4 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'Eye_Vol': [1, 2, 3, 5], })
df5 = pd.DataFrame({'ID': ['Mary', 'Mike', 'Barry', 'Scotty'],'Finger': [1.3, 2.123, 3.4, 5.5], })

dfs = [df1, df2, df3, df4,df5]

df_final = reduce(lambda left,right: pd.merge(left,right,on='ID'), dfs)

df_final
df = df_final

print(df is df_final) #Prints True. They are both the same dataframe.

df_raw = df.copy() #Modified

print (df is df_raw) #Prints False. the copy method created a copy of the underlying dataframe object.
df_raw.columns = df_raw.columns.str.replace(r"_Vol", "_Vol_Raw")
df_raw = pd.DataFrame(data = df_raw, columns= df_raw.columns)
print(df_raw)
print(df) #No longer affected by df_raw

The reason why simple assignment shows the original behaviour is because names refer to values in python. assignment just gives 2 labels both pointing to the same underlying dataframe object. So, when the object is modified, all names tied to the object reflect the changes. Good further read here

Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
0

If you want to copy and rename the columns, you can use rename to do it in a single step, by default the method copies the underlying data:

df_raw = df.rename(axis='columns', mapper=lambda s: s.replace(r"_Vol", "_Vol_Raw"))

print(df)
print(df_raw)

Output

       ID  eTIV  Ear_Vol  Nose  Eye_Vol  Finger
0    Mary  1.12        5     1        1   1.300
1    Mike  2.22        6     2        2   2.123
2   Barry  3.43        7     3        3   3.400
3  Scotty  5.43        8     5        5   5.500
       ID  eTIV  Ear_Vol_Raw  Nose  Eye_Vol_Raw  Finger
0    Mary  1.12            5     1            1   1.300
1    Mike  2.22            6     2            2   2.123
2   Barry  3.43            7     3            3   3.400
3  Scotty  5.43            8     5            5   5.500
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76