I have a dataframe where some column labels occur multiple times (i.e., some columns have the same label). This is causing me problems -- I may post more about this separately, because some of the behavior seems a little strange, but here I just wanted to ask about deleting some of these columns. That is, for each column label that occurs multiple times, I would like to delete all but the first column it heads. Here's an exammple:
In [5]: arr = np.array([[0.0, 1.0, 2.0, 3.0], [4.0, 5.0, 6.0, 7.0]])
In [6]: df = pd.DataFrame(data=arr, columns=['A', 'C', 'E', 'A'])
In [7]: df
Out[7]:
A C E A
0 0 1 2 3
1 4 5 6 7
If I drop columns using the label, all columns headed by that label are dropped:
In [9]: df.drop('A', axis=1)
Out[9]:
C E
0 1 2
1 5 6
So I thought I'd try dropping by the column index, but that also deletes all the columns headed by that label:
In [12]: df.drop(df.columns[3], axis=1)
Out[12]:
C E
0 1 2
1 5 6
How can I do what I want, that is, for each such label, delete all but one of the columns? For the above example, I'd want to end up with:
A C E
0 0 1 2
1 4 5 6
For now I've relabeled the columns, as follows:
columns = {}
new_columns = []
duplicate_num = 0
for n in df.columns:
if n in columns:
new_columns.append("duplicate%d" % (duplicate_num))
duplicate_num += 1
else:
columns[n] = 1
new_columns.append(n)
df.columns = new_columns
This works fine for my needs, but it doesn't seem like the best/cleanest solution. Thanks.
Edit: I don't see how this is a duplicate of the other question. For one thing, that deals with duplicate columns, not duplicate column labels. For another, the suggested solution there involved transposing the dataframe (twice), but as mentioned there, transposing large dataframes is inefficient, and in fact I am dealing with large dataframes.