0

I have a Python 3.x script utilizing Pandas running in Jupityr notebook. The script makes heavy use of Pandas Dataframes. Today, I encountered some erratic errors as multiple columns were being added to my Dataframe. Upon testing the issue, I noticed, if I rerun the script in Jupityr, changes were being made to an original Dataframe I returned despite the Dataframe being in a function which returns none and despite multiple variable names being assigned to the Dataframe.

From what I read, the variables might simply act as pointers back to the original Dataframe meaning the original Dataframe is being modified regardless of the variable I am performing the operation on. It appears the only way around this is to perform a copy of the Dataframe when assigning a new variable. This is confusing, as I thought the operation would be local to whatever variables I am using in the function, AND this does not appear to be consistently modifying my original Dataframes when I perform operations on them in other functions. I set up the following test:

#dataframe passed in here, print the dataframe to verify accuracy - columns are ColumnA and ColumnB
print(mydataframe)
#assign a new variable to the existing dataframe
test_the_dataframe1=mydataframe
#assign a BACKUP of the dataframe to print in case the original is modified
MY_BACKUP_1=mydataframe
#add a column to the new dataframe variable that subtracts ColumnB from ColumnA
test_the_dataframe1['Diff'] = test_the_dataframe1['ColumnA'] - test_the_dataframe1['ColumnB']
#print the untouched backup of the dataframe
print(MY_BACKUP_1)

MY_BACKUP_1 is modified by the operations being performed on test_the_dataframe1. I would expect MY_BACKUP_1 to have my original columns only (ColumnA and ColumnB) and not the additional "Diff" column". If I run this three times, it adds three columns to any reference of the Dataframe. Again, since this is not consistent with other modifications I have made (unfortunately, I do not have any specific examples off hand but I believe the difference to be based on whatever method I am using), can someone please explain what operations may alter my Dataframe through ALL Variables vs. what might not (and stay local to the variable)? I am GUESSING the use case for different variables simply pointing back to one Dataframe might be to contain something like a sort or filter you need to reuse?

Additionally, I am running this in Jupityr - so I am wondering if some of the inconsistencies I am experiencing are due to caching.

The reason why an answer to this is important for my use case is because this script reads from a database, where the server load should be kept down - so as I test it, I need to run the script as few times as possible.

Also, since I am new to Python and programming in general, would this same behavior be encountered on all mutable objects?

ByRequest
  • 277
  • 2
  • 13
  • https://www.programiz.com/python-programming/shallow-deep-copy – Jessica May 29 '19 at 21:42
  • So even if the operation is made within a variable in a function it's still just a reference and will alter the original database outside of the function? – ByRequest May 29 '19 at 23:42

0 Answers0