3

I have a dataframe where some column labels occur multiple times (i.e., some columns have the same label). This is causing me problems -- I may post more about this separately, because some of the behavior seems a little strange, but here I just wanted to ask about deleting some of these columns. That is, for each column label that occurs multiple times, I would like to delete all but the first column it heads. Here's an exammple:

In [5]: arr = np.array([[0.0, 1.0, 2.0, 3.0], [4.0, 5.0, 6.0, 7.0]])

In [6]: df = pd.DataFrame(data=arr, columns=['A', 'C', 'E', 'A'])

In [7]: df
Out[7]:
   A  C  E  A
0  0  1  2  3
1  4  5  6  7

If I drop columns using the label, all columns headed by that label are dropped:

In [9]: df.drop('A', axis=1)
Out[9]:
   C  E
0  1  2
1  5  6

So I thought I'd try dropping by the column index, but that also deletes all the columns headed by that label:

In [12]: df.drop(df.columns[3], axis=1)
Out[12]:
   C  E
0  1  2
1  5  6

How can I do what I want, that is, for each such label, delete all but one of the columns? For the above example, I'd want to end up with:

   A  C  E
0  0  1  2
1  4  5  6

For now I've relabeled the columns, as follows:

columns = {}
new_columns = []
duplicate_num = 0
for n in df.columns:
    if n in columns:
        new_columns.append("duplicate%d" % (duplicate_num))
        duplicate_num += 1
    else:
        columns[n] = 1
        new_columns.append(n)
df.columns = new_columns

This works fine for my needs, but it doesn't seem like the best/cleanest solution. Thanks.

Edit: I don't see how this is a duplicate of the other question. For one thing, that deals with duplicate columns, not duplicate column labels. For another, the suggested solution there involved transposing the dataframe (twice), but as mentioned there, transposing large dataframes is inefficient, and in fact I am dealing with large dataframes.

mepstein
  • 133
  • 2
  • 7
  • Possible duplicate of [python pandas remove duplicate columns](http://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns) – Fabio Lamanna Nov 07 '15 at 18:07
  • I need the column labels to be unique. For now, I'd like to drop all but the first column in the group (I stated this in the OP). (That may change, in which case I'll have to adapt the solution, but I'm not going to worry about that for now.) – mepstein Nov 07 '15 at 19:36

1 Answers1

5
In [18]:  
df.ix[: , ~df.columns.duplicated()]
Out[18]:
    A   C   E
0   0   1   2
1   4   5   6

Explanation


In [19]:
~df.columns.duplicated()
Out[19]:
array([ True,  True,  True, False], dtype=bool)

as you can see here you need first to check whether a column name is duplicated or not , notice that I've added ~ at the beginning of the function . then you can slice columns using the non duplicated values

Nader Hisham
  • 5,214
  • 4
  • 19
  • 35
  • This does do what I specified (drop all but the first column in each group). And it's quite concise. Thanks, I may end up going with it. Ideally I was hoping for some slight variation on the drop() method, because that seems so close to what I want. It seems there should be some way to drop a column by specifying its (numeric) index, and not have all the like-labelled columns be dropped. – mepstein Nov 07 '15 at 20:15
  • actually the drop function doesn't take index as a parameter , I tried to pass `int` to `drop` but it did nothing , I'm not sure why do you prefer to drop column by it's index – Nader Hisham Nov 08 '15 at 01:59
  • It's not that I prefer to use the index, I just want to be able to say "drop the third column" rather than "drop all the columns with the label on the third column". – mepstein Nov 08 '15 at 15:08