11

By grouping two columns I made some changes.

I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
Neer
  • 1,627
  • 3
  • 13
  • 8

6 Answers6

23

It's probably easiest to use a groupby (assuming they have duplicate names too):

In [11]: df
Out[11]:
   A  B  B
0  a  4  4
1  b  4  4
2  c  4  4

In [12]: df.T.groupby(level=0).first().T
Out[12]:
   A  B
0  a  4
1  b  4
2  c  4

If they have different names you can drop_duplicates on the transpose:

In [21]: df
Out[21]:
   A  B  C
0  a  4  4
1  b  4  4
2  c  4  4

In [22]: df.T.drop_duplicates().T
Out[22]:
   A  B
0  a  4
1  b  4
2  c  4

Usually read_csv will usually ensure they have different names...

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • FYI @Andy, there is a new option in 0.11.1 that controls this ``mangle_dup_cols``; default is TO mangle (e.g. produce unique cols), in 0.12, this will change to leave dups in place – Jeff Jun 05 '13 at 12:19
4

Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442

Community
  • 1
  • 1
kalu
  • 2,594
  • 1
  • 21
  • 22
  • Just a note for others that the best answer is not the accepted one in that thread. Best answer -> https://stackoverflow.com/a/40435354/2507197 – Alter Jun 22 '17 at 03:21
3

This is the best I found so far.

remove = []
cols = df.columns
for i in range(len(cols)-1):
    v = df[cols[i]].values
    for j in range(i+1,len(cols)):
        if np.array_equal(v,df[cols[j]].values):
            remove.append(cols[j])

df.drop(remove, axis=1, inplace=True)

https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code

3

It's already answered here python pandas remove duplicate columns. Idea is that df.columns.duplicated() generates boolean vector where each value says whether it has seen the column before or not. For example, if df has columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.

Using the above vector and using loc method of df which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector] we can select columns.

column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]
yugandhar
  • 580
  • 7
  • 16
  • This is the best answer as it actually drops _only_ the duplicate columns. Most of the other answers I've seen will drop the original _and_ the duplicates. – davidavr Nov 09 '20 at 22:20
  • `.columns` is not a callable. – Kots Jun 07 '22 at 08:59
  • Can you check type on which you are calling, this works only for pandas DataFrame type. You can use `typeof ` to check the type – yugandhar Jun 08 '22 at 10:57
0

I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):

df.drop(df.columns[i], axis=1)
Dan Carter
  • 433
  • 4
  • 14
0

The fast solution for dataset without NANs:

share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]