22

In pandas, given a DataFrame D:

+-----+--------+--------+--------+   
|     |    1   |    2   |    3   |
+-----+--------+--------+--------+
|  0  | apple  | banana | banana |
|  1  | orange | orange | orange |
|  2  | banana | apple  | orange |
|  3  | NaN    | NaN    | NaN    |
|  4  | apple  | apple  | apple  |
+-----+--------+--------+--------+

How do I return rows that have the same contents across all of its columns when there are three columns or more such that it returns this:

+-----+--------+--------+--------+   
|     |    1   |    2   |    3   |
+-----+--------+--------+--------+
|  1  | orange | orange | orange |
|  4  | apple  | apple  | apple  |
+-----+--------+--------+--------+

Note that it skips rows when all values are NaN.

If this were only two columns, I usually do D[D[1]==D[2]] but I don't know how to generalize this for more than 2 column DataFrames.

kentwait
  • 1,969
  • 2
  • 21
  • 42

6 Answers6

18

Similar to Andy Hayden answer with check if min equal to max (then row elements are all duplicates):

df[df.apply(lambda x: min(x) == max(x), 1)]
lowtech
  • 2,492
  • 2
  • 22
  • 31
17

My entry:

>>> df
        0       1       2
0   apple  banana  banana
1  orange  orange  orange
2  banana   apple  orange
3     NaN     NaN     NaN
4   apple   apple   apple

[5 rows x 3 columns]
>>> df[df.apply(pd.Series.nunique, axis=1) == 1]
        0       1       2
1  orange  orange  orange
4   apple   apple   apple

[2 rows x 3 columns]

This works because calling pd.Series.nunique on the rows gives:

>>> df.apply(pd.Series.nunique, axis=1)
0    2
1    1
2    3
3    0
4    1
dtype: int64

Note: this would, however, keep rows which look like [nan, nan, apple] or [nan, apple, apple]. Usually I want that, but that might be the wrong answer for your use case.

DSM
  • 342,061
  • 65
  • 592
  • 494
  • 1
    Regarding the note, it is possible to dropna() nan values. Then it should work just fine, shouldn't it? – Aidis Jul 31 '14 at 15:37
  • Is there a simple way to modify this in order to keep also rows such as the first one? ("apple", "banana", "banana"). I would need to do a similar thing but keeping rows that have "at least" two equal values. – Tommy Jan 18 '18 at 02:40
7

I would check whether each row is equal to its first element:

In [11]: df.eq(df[1], axis='index')  # Note: funky broadcasting with df == df[1]
Out[11]: 
      1      2      3
0  True  False  False
1  True   True   True
2  True  False  False
3  True   True   True
4  True   True   True

[5 rows x 3 columns]

If all in the row are True, then all elements in the row are the same:

In [12]: df.eq(df[1], axis='index').all(1)
Out[12]: 
0    False
1     True
2    False
3     True
4     True
dtype: bool

Restrict just to the rows and optionally dropna:

In [13]: df[df.eq(df[1], axis='index').all(1)]
Out[13]: 
        1       2       3
1  orange  orange  orange
3     NaN     NaN     NaN
4   apple   apple   apple

[3 rows x 3 columns]

In [14]: df[df.eq(df[1], axis='index').all(1)].dropna()
Out[14]: 
        1       2       3
1  orange  orange  orange
4   apple   apple   apple

[2 rows x 3 columns]
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
3

In newer versions of pandas, you can use nunique

In [815]: df[df.nunique(1).eq(1)]
Out[815]:
        0       1       2
1  orange  orange  orange
4   apple   apple   apple

Details

In [816]: df
Out[816]:
        0       1       2
0   apple  banana  banana
1  orange  orange  orange
2  banana   apple  orange
3     NaN     NaN     NaN
4   apple   apple   apple

In [817]: df.nunique(1)
Out[817]:
0    2
1    1
2    3
3    0
4    1
dtype: int64

In [818]: df.nunique(1).eq(1)
Out[818]:
0    False
1     True
2    False
3    False
4     True
dtype: bool
Zero
  • 74,117
  • 18
  • 147
  • 154
2

You could use set to create a list of the index locations that conform to your rule, and then use that list to slice the data frame. For example:

import pandas as pd
import numpy as np

D = {0  : ['apple' , 'banana', 'banana'], 1 : ['orange', 'orange', 'orange'], 2: ['banana', 'apple', 'orange'], 3: [np.nan, np.nan, np.nan], 4 : ['apple', 'apple', 'apple']} 
DF = pd.DataFrame(D).T

Equal = [row for row in DF.index if len(set(DF.iloc[row])) == 1]

DF.iloc[Equal]

Note that this excludes the missing value row without you having to expressly exclude missing values. This is because due to the nature of missing values in a series.

Woody Pride
  • 13,539
  • 9
  • 48
  • 62
1

based on DSM's answer, you may want this method:

import pandas as pd

def filter_data(df):
    df = df.dropna(inplace = True)
    df = df[df.apply(pd.Series.nunique, axis=1)]
    return df
Community
  • 1
  • 1