I aggregate data from a lot of sources and ended up in this situation where I have a large number of duplicate columns that are virtually similar but one of them is of type int
while the other is of type float
:
I'd like to know if there is a way to automatically keep only the first instance of each column that has the same name as another?
I was pointed to this other question and I could use
data.columns[data.columns.duplicated(keep=False)].tolist()
to get the names of duplicated columns
and then, continuing to adapt this answer to this problem looks like this :
for column_name in data.columns[data.columns.duplicated(keep=False)].tolist():
if column_name not in renamer:
renamer[column_name] = [column_name]
else:
renamer[column_name].append(column_name +'_to_drop_'+str(len(renamer[column_name])))
data = data.rename(
columns=lambda column_name: renamer[column_name].pop(0)
if column_name in renamer
else column_name
)
data = data.drop([col for col in flow_curves.columns if '_to_drop_' in col],axis=1)
It does work now. But I'm wondering if there is not a simpler way to do this?