0

I'm having an issue with Pandas dataframes. It seems that Pandas/Python generate a copy of the DF somewhere in my code as opposed to performing the modifications to the original DF.

In the code below, "update_df" still sees the DF with a "file_exists" column, which should have been removed by the previous function.

MAIN:

if __name__ == '__main__':
    df_main = load_df()
    clean_df2(df_main)
    update_df(df_main, image_path_main)
    .....

clean_df2

def clean_df2(df): #remove non-existing files from DF
    df['file_exists'] = True # add column, set all to True?
    .....
    df = df[df['file_exists'] != False] #Keep only records that exist
    df.drop('file_exists', 1, inplace=True)  # delete the temporary column
    df.reset_index(drop=True, inplace = True)  # reindex if source has gaps

update_df:

def update_df(df, image_path): #add DF rows for files not yet in DF
    print(df)
    ....
Borisw37
  • 739
  • 2
  • 7
  • 30

1 Answers1

1

I think when you do:

df = df[df['file_exists'] != False]

You've created a copy of the original df.

To make it work, you can change your function to:

def clean_df2(df): #remove non-existing files from DF
    df['file_exists'] = True # add column, set all to True?
    .....
    return df

And when you call clean_df2(df), do the following:

df = clean_df2(df)
Allen Qin
  • 19,507
  • 8
  • 51
  • 67