How to remove duplicate columns from a dataframe using python pandas

Question

By grouping two columns I made some changes.

I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?

Do they have same column name? – waitingkuo Jun 05 '13 at 11:38 — waitingkuo, Jun 05 '13 at 11:38

Andy Hayden · Answer 1 · 2013-06-05T12:11:46.363

23

It's probably easiest to use a groupby (assuming they have duplicate names too):

In [11]: df
Out[11]:
   A  B  B
0  a  4  4
1  b  4  4
2  c  4  4

In [12]: df.T.groupby(level=0).first().T
Out[12]:
   A  B
0  a  4
1  b  4
2  c  4

If they have different names you can drop_duplicates on the transpose:

In [21]: df
Out[21]:
   A  B  C
0  a  4  4
1  b  4  4
2  c  4  4

In [22]: df.T.drop_duplicates().T
Out[22]:
   A  B
0  a  4
1  b  4
2  c  4

Usually read_csv will usually ensure they have different names...

edited Jun 05 '13 at 12:11

answered Jun 05 '13 at 12:05

Andy Hayden

359,921
101
625
535

FYI @Andy, there is a new option in 0.11.1 that controls this ``mangle_dup_cols``; default is TO mangle (e.g. produce unique cols), in 0.12, this will change to leave dups in place – Jeff Jun 05 '13 at 12:19

score 4 · Answer 2 · edited May 23 '17 at 12:18

4

Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442

edited May 23 '17 at 12:18

Community

1
1

answered Oct 06 '15 at 03:24

kalu

2,594
1
21
22

Just a note for others that the best answer is not the accepted one in that thread. Best answer -> https://stackoverflow.com/a/40435354/2507197 – Alter Jun 22 '17 at 03:21

score 3 · Answer 3 · answered Apr 10 '16 at 12:06

This is the best I found so far.

remove = []
cols = df.columns
for i in range(len(cols)-1):
    v = df[cols[i]].values
    for j in range(i+1,len(cols)):
        if np.array_equal(v,df[cols[j]].values):
            remove.append(cols[j])

df.drop(remove, axis=1, inplace=True)

https://www.kaggle.com/kobakhit/santander-customer-satisfaction/0-84-score-with-36-features-only/code

score 3 · Answer 4 · answered Dec 13 '19 at 09:16

3

It's already answered here python pandas remove duplicate columns. Idea is that df.columns.duplicated() generates boolean vector where each value says whether it has seen the column before or not. For example, if df has columns ["Col1", "Col2", "Col1"], then it generates [False, False, True]. Let's take inversion of it and call it as column_selector.

Using the above vector and using loc method of df which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector] we can select columns.

column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]

answered Dec 13 '19 at 09:16

yugandhar

580
7
16

This is the best answer as it actually drops _only_ the duplicate columns. Most of the other answers I've seen will drop the original _and_ the duplicates. – davidavr Nov 09 '20 at 22:20
`.columns` is not a callable. – Kots Jun 07 '22 at 08:59
Can you check type on which you are calling, this works only for pandas DataFrame type. You can use `typeof ` to check the type – yugandhar Jun 08 '22 at 10:57

score 0 · Answer 5 · answered Jun 21 '17 at 17:17

I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):

df.drop(df.columns[i], axis=1)

score 0 · Answer 6 · answered Apr 30 '22 at 18:23

0

The fast solution for dataset without NANs:

share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]

answered Apr 30 '22 at 18:23

Alexandr Kosolapov

153
1
4

How to remove duplicate columns from a dataframe using python pandas

6 Answers6

Linked