47

I have a dataframe that has characters in it - I want a boolean result by row that tells me if all columns for that row have the same value.

For example, I have

df = [  a   b   c   d

0  'C'   'C'   'C'   'C' 

1  'C'   'C'   'A'   'A'

2  'A'   'A'   'A'   'A' ]

and I want the result to be

0  True

1  False

2  True

I've tried .all but it seems I can only check if all are equal to one letter. The only other way I can think of doing it is by doing a unique on each row and see if that equals 1? Thanks in advance.

A.L
  • 10,259
  • 10
  • 67
  • 98
Lisa L
  • 473
  • 1
  • 4
  • 6

5 Answers5

52

I think the cleanest way is to check all columns against the first column using eq:

In [11]: df
Out[11]: 
   a  b  c  d
0  C  C  C  C
1  C  C  A  A
2  A  A  A  A

In [12]: df.iloc[:, 0]
Out[12]: 
0    C
1    C
2    A
Name: a, dtype: object

In [13]: df.eq(df.iloc[:, 0], axis=0)
Out[13]: 
      a     b      c      d
0  True  True   True   True
1  True  True  False  False
2  True  True   True   True

Now you can use all (if they are all equal to the first item, they are all equal):

In [14]: df.eq(df.iloc[:, 0], axis=0).all(1)
Out[14]: 
0     True
1    False
2     True
dtype: bool
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • This seems like the most intuitive to me and us the way I went. Thanks. – Lisa L Mar 29 '14 at 04:29
  • 4
    Better write it as `df.eq(df.iloc[:, 0], axis=0).all(axis=1)` – Dr Fabio Gori May 24 '19 at 13:57
  • Note from [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.eq.html): "Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN)." – hashlash Mar 18 '20 at 16:22
18

Compare array by first column and check if all Trues per row:

Same solution in numpy for better performance:

a = df.values
b = (a == a[:, [0]]).all(axis=1)
print (b)
[ True  True False]

And if need Series:

s = pd.Series(b, axis=df.index)

Comparing solutions:

data = [[10,10,10],[12,12,12],[10,12,10]]
df = pd.DataFrame(data,columns=['Col1','Col2','Col3'])

#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)

#jez - numpy array
In [14]: %%timeit
    ...: a = df.values
    ...: b = (a == a[:, [0]]).all(axis=1)
141 µs ± 3.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

#jez - Series 
In [15]: %%timeit
    ...: a = df.values
    ...: b = (a == a[:, [0]]).all(axis=1)
    ...: pd.Series(b, index=df.index)
169 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

#Andy Hayden
In [16]: %%timeit
    ...: df.eq(df.iloc[:, 0], axis=0).all(axis=1)
2.22 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#Wen1
In [17]: %%timeit
    ...: list(map(lambda x : len(set(x))==1,df.values))
56.8 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#K.-Michael Aye
In [18]: %%timeit
    ...: df.apply(lambda x: len(set(x)) == 1, axis=1)
686 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Wen2    
In [19]: %%timeit
    ...: df.nunique(1).eq(1)
2.87 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Why is `DataFrame.apply()` so terribly slow in comparison to map (Aye vs. Wen1)? Both run the same lambda for every row. What causes this overhead? – normanius Oct 22 '19 at 17:15
  • Under `pandas == 1.1.5 `, `numpy == 1.19.5` and `Python == 3.8`, for 5,000,000 rows and 6 cols `int` data, the top 3 methods shares almost same efficiency and `df.eq` is 20% better. – Travis Mar 01 '21 at 13:12
  • @Travis - ya, it is possible, answer is 3 years old. But I can test again now. – jezrael Mar 01 '21 at 13:15
  • 1
    Running your example data, `jez - numpy array` is 78µs, `jez - Series` is 150µs and `df.eq` is 593µs. It may depends on different `dtypes` or `shape` of data – Travis Mar 01 '21 at 13:19
  • @Travis - yop, agree, my data are numeric. – jezrael Mar 01 '21 at 13:19
9

nunique: New in version 0.20.0.(Base on timing benchmark from Jez , if performance is not important you can using this one)

df.nunique(axis = 1).eq(1)
Out[308]: 
0     True
1    False
2     True
dtype: bool

Or you can using map with set

list(map(lambda x : len(set(x))==1,df.values))
BENY
  • 317,841
  • 20
  • 164
  • 234
  • One imo important thing is to remember about `dropna` parameter in `nunique` method. If there are NaNs in some rows, nunique by default will not count them, and above will show that values are equal. We have to set `dropna=False` to avoid this. So the correct line in such case will be `df.nunique(axis = 1, dropna=False).eq(1)` – Konrad May 23 '20 at 17:47
2
df = pd.DataFrame.from_dict({'a':'C C A'.split(),
                        'b':'C C A'.split(),
                        'c':'C A A'.split(),
                        'd':'C A A'.split()})
df.apply(lambda x: len(set(x)) == 1, axis=1)
0     True
1    False
2     True
dtype: bool

Explanation: set(x) has only 1 element, if all elements of the row are the same. The axis=1 option applies any given function over the rows instead.

K.-Michael Aye
  • 5,465
  • 6
  • 44
  • 56
2

You can use nunique(axis=1) so the results (added to a new column) can be obtained by:

df['unique'] = df.nunique(axis=1) == 1

The answer by @yo-and-ben-w uses eq(1) but I think == 1 is easier to read.

Duke
  • 1,332
  • 12
  • 12