Fast method for removing duplicate columns in pandas.Dataframe

Question

so by using

df_ab = pd.concat([df_a, df_b], axis=1, join='inner')

I get a Dataframe looking like this:

    A    A    B    B
0   5    5   10   10
1   6    6   19   19

and I want to remove its multiple columns:

    A     B
0   5    10
1   6    19

Because df_a and df_b are subsets of the same Dataframe I know that all rows have the same values if the column name is the same. I have a working solution:

df_ab = df_ab.T.drop_duplicates().T

but I have a number of rows so this one is very slow. Does someone have a faster solution? I would prefer a solution where explicit knowledge of the column names isn't needed.

score 51 · Answer 1 · answered Apr 19 '17 at 05:54

51

The easiest way is:

df = df.loc[:,~df.columns.duplicated()]

One line of code can change everything

answered Apr 19 '17 at 05:54

Prayson W. Daniel

14,191
4
51
57

3

This should be the verified answer, as not ALL columns have to be the same exactly all the time – Jeru Luke Jul 06 '17 at 18:46
This fails for large number of columns. I get this error `MemoryError: Unable to allocate 480. GiB for an array with shape (87494, 736334) and data type object`. This is the shape of my dataframe `(736334, 1312)`. – Murtaza Haji May 07 '20 at 03:19
If I were you I would not read all data at once. Read it in chunks. E.g. Column/N and operates in smaller chunks or randomly read the 5 numbers of rows saw (736334, 5) and remove duplicates columns. Then get the remaining columns as a list, and read your data keeping only those columns. Look at Pandas-ish library like Modin, Dask, Ray, Blaze that support large data and checkout https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html – Prayson W. Daniel May 07 '20 at 04:07
Plus if you have GPUs see cudf. – Prayson W. Daniel May 07 '20 at 04:19

unutbu · Answer 2 · 2015-08-17T00:44:33.197

13

Perhaps you would be better off avoiding the problem altogether, by using pd.merge instead of pd.concat:

df_ab = pd.merge(df_a, df_b, how='inner')

This will merge df_a and df_b on all columns shared in common.

edited Aug 17 '15 at 00:44

answered Aug 17 '15 at 00:30

unutbu

842,883
184
1,785
1,677

Yes, that's actually better than my concat :D thanks. – Peter Klauke Aug 17 '15 at 00:50
2

Although concat can take more than two at a time. – Teepeemm Feb 03 '21 at 23:41
1

Merging is generating duplicate rows. Couldn't figure out why. But concatenation doesn't do that. – hafiz031 Nov 08 '21 at 19:28

score 9 · Accepted Answer · answered Aug 17 '15 at 00:30

9

You may use np.unique to get indices of unique columns, and then use .iloc:

>>> df
   A  A   B   B
0  5  5  10  10
1  6  6  19  19
>>> _, i = np.unique(df.columns, return_index=True)
>>> df.iloc[:, i]
   A   B
0  5  10
1  6  19

answered Aug 17 '15 at 00:30

behzad.nouri

74,723
18
126
124

I don't know how pandas is comparing the speed, but they claim the built in unique method is much faster. Index.unique() http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.Index.unique.html – bretcj7 Dec 22 '17 at 02:53

James Wright · Answer 4 · 2019-07-30T12:19:30.060

9

For those who skip the question and look straight at answers, the simplest way for me is to use OP's solution (assuming you don't run into the same performance issues he did: Transpose the dataframe, use drop_duplicates, and then Transpose it again:

df.T.drop_duplicates().T

edited Jul 30 '19 at 12:19

answered Nov 02 '18 at 20:28

James Wright

1,293
2
16
32

3

This worked for me, but was very slow. Answer from @Prayson W. Daniel was a fraction of the speed. – Declan Sep 27 '19 at 14:32
That answer only works if the column names are identical. If. you have identical column values and different names, you'd want the Transpose solution. – GrimSqueaker Sep 14 '20 at 08:52
This may be slower because it creates a new object instead of operating on a view. With the solution of @Prayson W. Daniel, I kept getting the `SettingWithCopyWarning`. – Teepeemm Feb 03 '21 at 23:58

Fast method for removing duplicate columns in pandas.Dataframe

4 Answers4