2

I have a dataFrame with 2 columns a A and B. I have to separate out subset of dataFrames using pandas to delete all the duplicate values.

For Example

My dataFrame looks like this

**A     B**
1     1
2     3
4     4
8     8 
5     6
4     7

Then the output should be

**A     B**
1     1       <--- both values Highlighted
2     3
4     4       <--- both values Highlighted
8     8       <--- both values Highlighted 
5     6
4     7       <--- value in column A highlighted

How do I do that?

Thanks in advance.

  • 1
    Welcome to Stack Overflow! Have you tried anything? Do you have any code to show us? Do also look at https://stackoverflow.com/questions/41203959/conditionally-format-python-pandas-cell and see [How to Ask](https://stackoverflow.com/help/how-to-ask) – Jaideep Shekhar Jan 11 '20 at 12:24

1 Answers1

3

You can use this:

def color_dupes(x):
    c1='background-color:red'
    c2=''
    cond=x.stack().duplicated(keep=False).unstack()
    df1 = pd.DataFrame(np.where(cond,c1,c2),columns=x.columns,index=x.index)
    return df1
df.style.apply(color_dupes,axis=None)
# if df has many columns: df.style.apply(color_dupes,axis=None,subset=['A','B'])

Example working code:

enter image description here

Explanation: First we stack the dataframe so as to bring all the columns into a series and find duplicated with keep=False to mark all duplicates as true:

df.stack().duplicated(keep=False)

0  A     True
   B     True
1  A    False
   B    False
2  A     True
   B     True
3  A     True
   B     True
4  A    False
   B    False
5  A     True
   B    False
dtype: bool

After this we unstack() the dataframe which gives a boolean dataframe with the same dataframe structure:

df.stack().duplicated(keep=False).unstack()
       A      B
0   True   True
1  False  False
2   True   True
3   True   True
4  False  False
5   True  False

Once we have this we assign the background color to values if True else no color using np.where

anky
  • 74,114
  • 11
  • 41
  • 70
  • Hello can you please tell me what is x? – Twinkle Lahariya Jan 11 '20 at 12:37
  • @anky_91 but what does X representing in this case – Twinkle Lahariya Jan 11 '20 at 12:46
  • @TwinkleLahariya it is any dataframe that you want to apply this style. Basically it is a function hence i named it x , you can apply the same function to `df` , `df1`.. etc whichever dataframe you want to apply this function , so the line `x.stack().duplicated(keep=False).unstack()` would return same as shown for `df.stack().duplicated(keep=False).unstack()` since it is applied on `df` – anky Jan 11 '20 at 12:49
  • @anky_91 and what if I have multiple columns in my xlsx file and I have to apply this to only two( say column E & F) of that file? – Twinkle Lahariya Jan 11 '20 at 12:53
  • @TwinkleLahariya then keep the function same just apply to a subset like `df.style.apply(color_dupes,axis=None,subset=['E','F'])` – anky Jan 11 '20 at 12:55
  • @anky_91 The solution you gave didn't work. Can we discuss this in chat? – Twinkle Lahariya Jan 11 '20 at 13:21
  • Hi the code was really helpful but do you know any ways where it would ignore NaN values and if certain specific columns from highlighting? For example if dates are present I dont want it to be highlighted. – Aps Sep 01 '21 at 18:54