2

I aggregate data from a lot of sources and ended up in this situation where I have a large number of duplicate columns that are virtually similar but one of them is of type int while the other is of type float :

enter image description here

I'd like to know if there is a way to automatically keep only the first instance of each column that has the same name as another?

I was pointed to this other question and I could use

data.columns[data.columns.duplicated(keep=False)].tolist()

to get the names of duplicated columns

and then, continuing to adapt this answer to this problem looks like this :

for column_name in data.columns[data.columns.duplicated(keep=False)].tolist():
    if column_name not in renamer:
        renamer[column_name] = [column_name]
    else:
        renamer[column_name].append(column_name +'_to_drop_'+str(len(renamer[column_name])))
        
data = data.rename(
    columns=lambda column_name: renamer[column_name].pop(0)
    if column_name in renamer 
    else column_name
)
        
data = data.drop([col for col in flow_curves.columns if '_to_drop_' in col],axis=1)

It does work now. But I'm wondering if there is not a simpler way to do this?

Manon
  • 77
  • 2
  • 9

0 Answers0