34

Problem

How to remove duplicate cells from each row, considering each row separately (and perhaps replace them with NaNs) in a Pandas dataframe?

It would be even better if we could shift all newly created NaNs to the end of each row.


Related but different posts

Posts on how to remove entire rows which are deemed duplicate:

Post on how to remove duplicates from a list which is in a Pandas column:

Answer given here returns a series of strings, not a dataframe.


Reproducible setup

import pandas as pd

Let's create a dataframe:

df = pd.DataFrame({'a': ['A', 'A', 'C', 'B'],
                   'b': ['B', 'D', 'B', 'B'],
                   'c': ['C', 'C', 'C', 'A'],
                   'd': ['D', 'D', 'B', 'A']},
                   index=[0, 1, 2, 3])

df created:

+----+-----+-----+-----+-----+
|    | a   | b   | c   | d   |
|----+-----+-----+-----+-----|
|  0 | A   | B   | C   | D   |
|  1 | A   | D   | C   | D   |
|  2 | C   | B   | C   | B   |
|  3 | B   | B   | A   | A   |
+----+-----+-----+-----+-----+

(Printed using this.)


A solution

One way of dropping duplicates from each row, considering each row separately:

df = df.apply(lambda row: pd.Series(row).drop_duplicates(keep='first'),axis='columns')

using apply(), a lambda function, pd.Series(), & Series.drop_duplicates().

Shove all NaNs to the end of each row, using Shift NaNs to the end of their respective rows:

df.apply(lambda x : pd.Series(x[x.notnull()].values.tolist()+x[x.isnull()].values.tolist()),axis='columns') 

Output:

+----+-----+-----+-----+-----+
|    | 0   | 1   | 2   | 3   |
|----+-----+-----+-----+-----|
|  0 | A   | B   | C   | D   |
|  1 | A   | D   | C   | nan |
|  2 | C   | B   | nan | nan |
|  3 | B   | A   | nan | nan |
+----+-----+-----+-----+-----+

Just as we wished.


Question

Is there a more efficient way to do this? Perhaps with some built-in Pandas functions?

zabop
  • 6,750
  • 3
  • 39
  • 84
  • I tried to find a question which this post is a duplicate of, but to my surprise, I haven't found any which is a good fit for a dupe. I welcome any suggestions regarding duplicates. – zabop Aug 25 '20 at 16:49
  • You never defined what you mean by *'duplicates'*, especially since you're using a different meaning to normal ('distinct values across multiple columns, considered row-wise'), and your title is overly broad, people will wrongly be sent here by Google and SO search. Wrt your example, on row 1 why did you not remove the second 'C' from column 'c', but you did remove the second 'D' from column 'd'? This makes no sense. Also, "replace cells with NaNs" is not really "remove", so this is two-questions-in-one and the code solutions will be different (`fillna`, `duplicated`, `drop_duplicates`, etc.). – smci Jan 21 '21 at 19:01
  • In your use-case, you have **multiple columns all considered equivalent**, same dtype, don't-care about the names. So 'FirstName=Murphy, LastName=Brown' would be considered a 'duplicate' of 'Brown, Murphy'; or a zipcode of 77024 and income of 60001 or customer-id of 45678 would all be considered 'equivalent' to other permutations of the same values, across columns. This is absolutely not a standard definition of 'duplicates'. Your data is really just an array, not a genuine dataframe, and the part *"shift all newly-created NaNs to the end of each row"* proves it. – smci Jan 21 '21 at 19:23
  • I don't get the the concern: "on row 1 why did you not remove the second 'C' from column 'c', but you did remove the second 'D' from column 'd'?" There is only one 'C' in row 1. – zabop Jan 21 '21 at 20:32
  • About the others: I hope your edit addresses these concerns & the post is now ok. – zabop Jan 21 '21 at 20:32
  • Although the beginning of the new question title, ie: "Removing Pandas rows with duplicate values across multiple columns" is misleading I think, since we are not actually removing the rows. – zabop Jan 21 '21 at 20:34
  • Could you find a question title to address this? – zabop Jan 21 '21 at 20:35
  • Then your example is meant to illustrate your (non-standard) definition of 'duplicate': "duplicate value within same row" (not within column): so on row 1 the 'D's are considered duplicates, but the 'C' in column 'c' is **not** (although it would be a duplicate if we scanned column 'c' vertically). And also, you don't want to remove rows; only replace duplicate values with NAs. (That's not standard use-case). Anyway I edited the title once more to state this. – smci Jan 21 '21 at 20:37
  • Yeah, I am aware usually duplicate is meant within columns. I like the new question title, thanks. – zabop Jan 21 '21 at 20:41
  • You know, if you only have a reasonably small well-defined set of distinct values across columns (do you?), I wouldn't start by using a dataframe column for each value, just a Python set, or [bit-array](https://stackoverflow.com/questions/20845686/python-bit-array-performant). Especially if the original order didn't matter. Can you give us a little context? Are these zipcodes? people's names? word-vectors? search index tags? something else? – smci Jan 21 '21 at 20:43
  • I don't remember the exact context, but these were indeed names (ie strings)... How I ended up having this exact situation, I can't recall now, but I'll comment later if I manage to remember. – zabop Jan 21 '21 at 23:05

5 Answers5

30

You can stack and then drop_duplicates that way. Then we need to pivot with the help of a cumcount level. The stack preserves the order the values appear in along the rows and the cumcount ensures that the NaN will appear in the end.

df1 = df.stack().reset_index().drop(columns='level_1').drop_duplicates()

df1['col'] = df1.groupby('level_0').cumcount()
df1 = (df1.pivot(index='level_0', columns='col', values=0)
          .rename_axis(index=None, columns=None))

   0  1    2    3
0  A  B    C    D
1  A  D    C  NaN
2  C  B  NaN  NaN
3  B  A  NaN  NaN

Timings

Assuming 4 columns, let's see how a bunch of these methods compare as the number of rows grow. The map and apply solutions have a good advantage when things are small, but they become a bit slower than the more involved stack + drop_duplicates + pivot solution as the DataFrame gets longer. Regardless, they all start to take a while for a large DataFrame.

import perfplot
import pandas as pd
import numpy as np

def stack(df):
    df1 = df.stack().reset_index().drop(columns='level_1').drop_duplicates()

    df1['col'] = df1.groupby('level_0').cumcount()
    df1 = (df1.pivot(index='level_0', columns='col', values=0)
              .rename_axis(index=None, columns=None))
    return df1

def apply_drop_dup(df):
    return pd.DataFrame.from_dict(df.apply(lambda x: x.drop_duplicates().tolist(),
                                           axis=1).to_dict(), orient='index')

def apply_unique(df):
    return pd.DataFrame(df.apply(pd.Series.unique, axis=1).tolist())


def list_map(df):
    return pd.DataFrame(list(map(pd.unique, df.values)))


perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(list('ABCD'), (n, 4)),
                                 columns=list('abcd')), 
    kernels=[
        lambda df: stack(df),
        lambda df: apply_drop_dup(df),
        lambda df: apply_unique(df),
        lambda df: list_map(df),
    ],
    labels=['stack', 'apply_drop_dup', 'apply_unique', 'list_map'],
    n_range=[2 ** k for k in range(18)],
    equality_check=lambda x,y: x.compare(y).empty,  
    xlabel='~len(df)'
)

enter image description here


Finally, if preserving the order in which the values originally appeared within each row is unimportant, you can use numpy. To de-duplicate you sort then check for differences. Then create an output array that shifts values to the right. Because this method will always return 4 columns, we require a dropna to match the other output in the case that every row has fewer than 4 unique values.

def with_numpy(df):
    arr = np.sort(df.to_numpy(), axis=1)
    r = np.roll(arr, 1, axis=1)
    r[:, 0] = np.NaN
    
    arr = np.where((arr != r), arr, np.NaN)
    
    # Move all NaN to the right. Credit @Divakar
    mask = pd.notnull(arr)
    justified_mask = np.flip(np.sort(mask, axis=1), 1)
    out = np.full(arr.shape, np.NaN, dtype=object) 
    out[justified_mask] = arr[mask]
    
    return pd.DataFrame(out, index=df.index).dropna(how='all', axis='columns')

with_numpy(df)
#   0  1    2    3
#0  A  B    C    D
#1  A  C    D  NaN
#2  B  C  NaN  NaN     # B/c this method sorts, B before C
#3  A  B  NaN  NaN

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(list('ABCD'), (n, 4)),
                                 columns=list('abcd')), 
    kernels=[
        lambda df: stack(df),
        lambda df: with_numpy(df),
    ],
    labels=['stack', 'with_numpy'],
    n_range=[2 ** k for k in range(3, 22)],
    # Lazy check to deal with string/NaN and irrespective of sort order. 
    equality_check=lambda x, y: (np.sort(x.fillna('ZZ').to_numpy(), 1) 
                                 == np.sort(y.fillna('ZZ').to_numpy(), 1)).all(),
    xlabel='len(df)'
)

enter image description here

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Thanks! This indeed works. Going to leave the question open a bit in case someone comes up with a simpler solution, if that doesn't happen, will definitely accept this one. – zabop Aug 25 '20 at 16:57
  • 1
    @ALollz: thanks for the `perfplot`. I am truly curious about the performance of each solution, but I am suck at setting it. I already upvoted yours. Otherwise, I would upvote it again on the `perfplot` :) – Andy L. Aug 25 '20 at 18:48
14

try something new

df = pd.DataFrame(list(map(pd.unique, df.values)))
Out[447]: 
   0  1     2     3
0  A  B     C     D
1  A  D     C  None
2  C  B  None  None
3  B  A  None  None
BENY
  • 317,841
  • 20
  • 164
  • 234
5

Use apply and construct a new dataframe by pd.DataFrame.from_dict with option orient='index'

df_final = pd.DataFrame.from_dict(df.apply(lambda x: x.drop_duplicates().tolist(),
                                               axis=1).to_dict(), orient='index')

Out[268]:
   0  1     2     3
0  A  B     C     D
1  A  D     C  None
2  C  B  None  None
3  B  A  None  None

Note: None practically is similar to NaN. If you want exact NaN. Just chain additional .fillna(np.nan)

Andy L.
  • 24,909
  • 4
  • 17
  • 29
4

You could search for duplicates on the row axis and then sort out the results to "push" the Nan at the end of the rows by sorting them out with a specific key:

duplicates = df.apply(pd.Series.duplicated, axis=1)
df.where(~duplicates, np.nan).apply(lambda x: pd.Series(sorted(x, key=pd.isnull)), axis=1)

Output

| 0   | 1   | 2   | 3   |
|:----|:----|:----|:----|
| A   | B   | C   | D   |
| A   | D   | C   | NaN |
| C   | B   | NaN | NaN |
| B   | A   | NaN | NaN |
Sebastien D
  • 4,369
  • 4
  • 18
  • 46
4

Apply pd.Series.unique on each row, extract the result and re-contruct the dataframe:

print (pd.DataFrame(df.apply(pd.Series.unique, axis=1).tolist()))

   0  1     2     3
0  A  B     C     D
1  A  D     C  None
2  C  B  None  None
3  B  A  None  None
Henry Yik
  • 22,275
  • 4
  • 18
  • 40