7

I have a DataFrame with columns with duplicate data with different names:

In[1]: df
Out[1]: 
  X1   X2  Y1   Y2
 0.0  0.0  6.0  6.0
 3.0  3.0  7.1  7.1
 7.6  7.6  1.2  1.2

I know .drop(columns = ) exists but is there a way more efficient way to drop these without having to list down the column names? or not.. please let me know as i can just use .drop()

Ch3steR
  • 20,090
  • 4
  • 28
  • 58
ahnnni
  • 85
  • 6

2 Answers2

7

We can use np.unique over axis 1. Unfortunately, there's no pandas built-in function to drop duplicate columns.

df.drop_duplicates only removes duplicate rows.

Return DataFrame with duplicate rows removed.

We can create a function around np.unique to drop duplicate columns.

def drop_duplicate_cols(df):
    uniq, idxs = np.unique(df, return_index=True, axis=1)
    return pd.DataFrame(uniq, index=df.index, columns=df.columns[idxs])

drop_duplicate_cols(X)
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2

Online Demo

NB: np.unique docs:

Returns the sorted unique elements of an array.

Workaround: To retain the original order, sort the idxs.


Using .T on dataframe having multiple dtypes is going to mess with your actual dtypes.

df = pd.DataFrame({'A': [0, 1], 'B': ['a', 'b'], 'C': [0, 1], 'D':[2.1, 3.1]})
df.dtypes
A      int64
B     object
C      int64
D    float64
dtype: object

df.T.T.dtypes
A    object
B    object
C    object
D    object
dtype: object
# To get back original `dtypes` we can use `.astype`
df.T.T.astype(df.dtypes).dtypes
A      int64
B     object
C      int64
D    float64
dtype: object
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • 2
    This should be faster as it uses numpy functions over numpy array without having to transform the shape. +1 – anky Sep 25 '21 at 05:37
  • 2
    @anky Thank you :D I believe `.T` has a certain overhead. timeit over OP;s data `np.unique` is almost 8X faster. But `np.unique` doesn't scale well I guess I read that somewhere. – Ch3steR Sep 25 '21 at 05:44
  • 2
    @anky One more major flaw is when a df has multiple types `.T` would change dtypes too. – Ch3steR Sep 25 '21 at 10:25
  • Not a flaw but a caveat to know. – Ch3steR Sep 25 '21 at 13:50
  • Apologies I forgot to mention this however my data has another column with no duplicates e.g. X1 X2 Y1 Y2 Z. I noticed using this code dropped Z column completely. how do I code everything with this column as an exception? – ahnnni Sep 26 '21 at 07:06
  • @ahnnni Post the data with `Z` column, please. But the solution should work. Please post the data so I can inspect what's going on. – Ch3steR Sep 26 '21 at 07:20
  • 1
    @ahnnni [Online demo](https://replit.com/@Ch3seR/tobool#main.py) working as expected. – Ch3steR Sep 26 '21 at 07:35
6

You could transpose with T and drop_duplicates then transpose back:

>>> df.T.drop_duplicates().T
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2
>>> 

Or with loc and duplicated:

>>> df.loc[:, df.T.duplicated(keep='last')]
    X1   Y1
0  0.0  6.0
1  3.0  7.1
2  7.6  1.2
>>> 
U13-Forward
  • 69,221
  • 14
  • 89
  • 114