0

I have two DataFrames: train_df and test_df and I store them in a list: combine = [train_df, test_df]. Both DFs have a column named "Gender", which is either "male" or "female". Now I want to modify that column in both DFs so that "male" is replaced with 0 and "female" with 1. I used the following code:

for dataset in combine:
    dataset["Gender"] = dataset["Gender"].map({"female": 1, "male": 0})

I noticed that it modified train_df and test_df, as well as both combine elements. Why is that? I thought that dataset here is a looping variable (so it stores just a local copy of a DF) and nothing will change (think Apply a for loop to multiple DataFrames in Pandas). And more generally, is it even appropriate to access DF columns in a loop like this (when we have multiple DFs)? Is there a more Pythonic way?

  • A for loop is the easiest solution there is and is faster too and the looping variables holds the reference to the dataframe so changes done here affect the dataframe directly. If you want concat the dataframes replace the values and split them ( I prefer a for loop) . – Bharath M Shetty Sep 20 '17 at 16:40
  • Assuming all your data is in one large dataframe before you split it, you could apply the map function you have above to that original dataframe just once, and then split your data? `data["Gender"] = data["Gender"].map({"female": 1, "male": 0})`. May not be exactly what you want but it could help. – PyRsquared Sep 20 '17 at 16:55
  • @Bharathshetty so how different is it from looping over a list of integers? Because in that case whatever you do with the looping variable doesn't affect the list element (see https://stackoverflow.com/questions/19290762/cant-modify-list-elements-in-a-loop-python). Is it that there is a difference in behaviour when looping over collection of mutables vs immutables? And thank you both for the splitting suggestion. – Jakub Łanecki Sep 20 '17 at 20:18

0 Answers0