0

I want to find all the rows that have the same values across all columns, or more specifically, drop all the rows that have difference. I was thinking I would iterate through each column and check for the greatest majority and remove the other rows. I feel like this method is not the best way to utilize a dataframe.

I was checking previous posts and someone had a similar question but they wanted the inverse result so I will use their example:

Input:

index  A  B  C  D  E  F ....
 0     1  2  3  4  2  2
 1     1  2  3  4  2  2
 2     5  5  5  5  5  5 
 3     7  7  6  7  7  7

Desired Output:

index  A  B  C  D  E  F ....
 0     1  2  3  4  2  2
 1     1  2  3  4  2  2

There can be many columns here.

Edit: The example I provided is wrong. Sorry I was tired. I have updated the example. Now that I have typed out my question I think I am basically just asking how do I find the rows that are all the same. If there are multiple groups of identical rows I want to know those as well.

nchuang
  • 83
  • 8
  • Which post are you referring to? I found this one which is almost what you need: https://stackoverflow.com/questions/22701799/pandas-dataframe-find-rows-where-all-columns-equal – panktijk Feb 12 '19 at 22:26
  • Have you tried df[df.duplicated(keep=False)] – Umar.H Feb 12 '19 at 22:29
  • @panktijk I don't want all columns equal. I realize now that the example I used is not correct. I want all rows equal. I will edit my post. – nchuang Feb 13 '19 at 15:06

2 Answers2

2

Using nunique

yourdf=df[df.nunique(1)==1]
yourdf
       A  B  C  D  E  F
index                  
1      2  2  2  2  2  2
2      5  5  5  5  5  5

Update

df[df.duplicated(keep=False)]
Out[11]: 
       A  B  C  D  E  F
index                  
0      1  2  3  4  2  2
1      1  2  3  4  2  2
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I am sorry about my OP, I meant I wanted all the rows that were the same. – nchuang Feb 13 '19 at 15:32
  • this works, but is there a way to sort it so I know which groups belong together? this only gives me Boolean if there is a duplicate – nchuang Feb 13 '19 at 15:50
  • @nchuang You can sort on all columns: `df[df.duplicated(keep=False)].sort_values(list(df.columns))` – panktijk Feb 13 '19 at 19:16
  • @ after filtered, `df.groupby(list(df))).ngroup()`, then you get the group number here , – BENY Feb 13 '19 at 19:19
  • `sort_values()` works but the `df.groupby(list(df))).ngroup()` gives every row -1? – nchuang Feb 13 '19 at 19:54
  • I tested it again on a much smaller set and it works. I guess it is my larger set, which I don't understand because I literally copied one of the rows 3x and `ngroup()` called them 1,2,3 instead of 1,1,1. – nchuang Feb 13 '19 at 21:13
  • thank you it does work. My original dataset had an empty column at the end I didn't see. Once I removed it it works. – nchuang Feb 13 '19 at 22:00
0

This is probably not a very efficient solution, but if your DataFrame is small enough, this should work:

df[df.std(axis=1)==0]

(I'm assuming all values are numerical)

Tacratis
  • 1,035
  • 1
  • 6
  • 16