How to drop duplicate data with different column names in pandas?

Question

I have a DataFrame with columns with duplicate data with different names:

In[1]: df
Out[1]: 
  X1   X2  Y1   Y2
 0.0  0.0  6.0  6.0
 3.0  3.0  7.1  7.1
 7.6  7.6  1.2  1.2

I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()

Ch3steR · Answer 1 · 2021-09-26T07:44:57.100

7

We can use np.unique over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.

df.drop_duplicates only removes duplicate rows.

Return DataFrame with duplicate rows removed.

We can create a function around np.unique to drop duplicate columns.

def drop_duplicate_cols(df):
    uniq, idxs = np.unique(df, return_index=True, axis=1)
    return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])

drop_duplicate_cols(X)
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2

Online Demo

NB: np.unique docs:

Returns the sorted unique elements of an array.

Workaround: To retain the original order, sort the idxs.

Using .T on dataframe having multiple dtypes is going to mess with your actual dtypes.

df = pd.DataFrame({'A': [0, 1], 'B': ['a', 'b'], 'C': [0, 1], 'D':[2.1, 3.1]})
df.dtypes
A      int64
B     object
C      int64
D    float64
dtype: object

df.T.T.dtypes
A    object
B    object
C    object
D    object
dtype: object
# To get back original `dtypes` we can use `.astype`
df.T.T.astype(df.dtypes).dtypes
A      int64
B     object
C      int64
D    float64
dtype: object

edited Sep 26 '21 at 07:44

answered Sep 25 '21 at 05:24

Ch3steR

20,090
4
28
58

2

This should be faster as it uses numpy functions over numpy array without having to transform the shape. +1 – anky Sep 25 '21 at 05:37
2

@anky Thank you :D I believe `.T` has a certain overhead. timeit over OP;s data `np.unique` is almost 8X faster. But `np.unique` doesn't scale well I guess I read that somewhere. – Ch3steR Sep 25 '21 at 05:44
2

@anky One more major flaw is when a df has multiple types `.T` would change dtypes too. – Ch3steR Sep 25 '21 at 10:25
Not a flaw but a caveat to know. – Ch3steR Sep 25 '21 at 13:50
Apologies I forgot to mention this however my data has another column with no duplicates e.g. X1 X2 Y1 Y2 Z. I noticed using this code dropped Z column completely. how do I code everything with this column as an exception? – ahnnni Sep 26 '21 at 07:06
@ahnnni Post the data with `Z` column, please. But the solution should work. Please post the data so I can inspect what's going on. – Ch3steR Sep 26 '21 at 07:20
1

@ahnnni [Online demo](https://replit.com/@Ch3seR/tobool#main.py) working as expected. – Ch3steR Sep 26 '21 at 07:35

score 6 · Answer 2 · answered Sep 25 '21 at 05:07

6

You could transpose with T and drop_duplicates then transpose back:

>>> df.T.drop_duplicates().T
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2
>>>

Or with loc and duplicated:

>>> df.loc[:, df.T.duplicated(keep='last')]
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2
>>>

answered Sep 25 '21 at 05:07

U13-Forward

69,221
14
89
114

Apologies I forgot to mention this however my data has another column with no duplicates e.g. X1 X2 Y1 Y2 Z. I noticed using this code dropped Z column completely. how do I code everything with this column as an exception? – ahnnni Sep 26 '21 at 06:53
@ahnnni Use the first solution then. – U13-Forward Sep 26 '21 at 06:55
@ahnnni The first solution should work – U13-Forward Sep 26 '21 at 07:02

How to drop duplicate data with different column names in pandas?

2 Answers2