21

so by using

df_ab = pd.concat([df_a, df_b], axis=1, join='inner')

I get a Dataframe looking like this:

    A    A    B    B
0   5    5   10   10
1   6    6   19   19

and I want to remove its multiple columns:

    A     B
0   5    10
1   6    19

Because df_a and df_b are subsets of the same Dataframe I know that all rows have the same values if the column name is the same. I have a working solution:

df_ab = df_ab.T.drop_duplicates().T

but I have a number of rows so this one is very slow. Does someone have a faster solution? I would prefer a solution where explicit knowledge of the column names isn't needed.

Peter Klauke
  • 441
  • 2
  • 4
  • 10

4 Answers4

51

The easiest way is:

df = df.loc[:,~df.columns.duplicated()]

One line of code can change everything

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
  • 3
    This should be the verified answer, as not ALL columns have to be the same exactly all the time – Jeru Luke Jul 06 '17 at 18:46
  • This fails for large number of columns. I get this error `MemoryError: Unable to allocate 480. GiB for an array with shape (87494, 736334) and data type object`. This is the shape of my dataframe `(736334, 1312)`. – Murtaza Haji May 07 '20 at 03:19
  • If I were you I would not read all data at once. Read it in chunks. E.g. Column/N and operates in smaller chunks or randomly read the 5 numbers of rows saw (736334, 5) and remove duplicates columns. Then get the remaining columns as a list, and read your data keeping only those columns. Look at Pandas-ish library like Modin, Dask, Ray, Blaze that support large data and checkout https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html – Prayson W. Daniel May 07 '20 at 04:07
  • Plus if you have GPUs see cudf. – Prayson W. Daniel May 07 '20 at 04:19
13

Perhaps you would be better off avoiding the problem altogether, by using pd.merge instead of pd.concat:

df_ab = pd.merge(df_a, df_b, how='inner')

This will merge df_a and df_b on all columns shared in common.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
9

You may use np.unique to get indices of unique columns, and then use .iloc:

>>> df
   A  A   B   B
0  5  5  10  10
1  6  6  19  19
>>> _, i = np.unique(df.columns, return_index=True)
>>> df.iloc[:, i]
   A   B
0  5  10
1  6  19
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • I don't know how pandas is comparing the speed, but they claim the built in unique method is much faster. Index.unique() http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.Index.unique.html – bretcj7 Dec 22 '17 at 02:53
9

For those who skip the question and look straight at answers, the simplest way for me is to use OP's solution (assuming you don't run into the same performance issues he did: Transpose the dataframe, use drop_duplicates, and then Transpose it again:

df.T.drop_duplicates().T
James Wright
  • 1,293
  • 2
  • 16
  • 32
  • 3
    This worked for me, but was very slow. Answer from @Prayson W. Daniel was a fraction of the speed. – Declan Sep 27 '19 at 14:32
  • That answer only works if the column names are identical. If. you have identical column values and different names, you'd want the Transpose solution. – GrimSqueaker Sep 14 '20 at 08:52
  • This may be slower because it creates a new object instead of operating on a view. With the solution of @Prayson W. Daniel, I kept getting the `SettingWithCopyWarning`. – Teepeemm Feb 03 '21 at 23:58