2

I have a Pandas DataFrame where certain rows have the same values in similar columns. I want to create a boolean mask that is True when all these columns have the same values for the specific row. I want to dynamically pass a list of columns to check. Example:

A | B | C | Mask
1 | 1 | 1 | True
2 | 2 | 3 | False
4 | 4 | 4 | True

The mask should be returned by my same_values function that was passed the DataFrame and a list of columns. For example

same_values(data, ['A', 'B', 'C'])

Without the dynamic pass I can do it like this:

data[(data['A']==data['B'])&(data['A']==data['C'])]

I could dynamically iterate over all the columns and compare them with the first passed column but this seems inefficient. Who has a better solution?

Jan van der Vegt
  • 1,471
  • 12
  • 34

4 Answers4

1

You can compare all df with first column by eq with all :

print (df.eq(df.iloc[:,0], axis=0))
      A     B      C
0  True  True   True
1  True  True  False
2  True  True   True

print (df.eq(df.iloc[:,0], axis=0).all(axis=1))
0     True
1    False
2     True
dtype: bool

If need comparing only few columns, use subset:

L = ['A','B','C']
print (df[L].eq(df.iloc[:,0], axis=0).all(axis=1))
0     True
1    False
2     True
dtype: bool
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

After discussing with a colleague he pointed me to this post:

Get rows that have the same value across its columns in pandas

I have tried both methods mentioned here and on the link posted and here are the results:

%timeit test1 = test[test.apply(pd.Series.nunique, axis=1)==1]
1.23 s per loop

%timeit test2 = test[test.eq(test['A'], axis='index').all(1)]
3.47 ms per loop

%timeit test3 = test[test.apply(lambda x: x.duplicated(keep=False).all(), axis=1)]
2.3 s per loop

%timeit test4 = test[test.apply(lambda x: x == (x.iloc[0]).all(), axis=1)]
4.5 s per loop
Community
  • 1
  • 1
Jan van der Vegt
  • 1,471
  • 12
  • 34
1

You can try this:

data = pd.DataFrame({'a': [1, 2, 4], 'b': [1, 2, 4], 'c': [1, 3, 4]})
data.apply(lambda x: len(set(x)) == 1, axis=1) 
Shravan
  • 2,553
  • 2
  • 16
  • 19
0

How bad is this:

list_a = [1, 2, 4]
list_b = [1, 2, 4]
list_c = [1, 3, 4]

longshot = [True if not x % 111 else False for x in list(map(lambda x: int(str(x[0])+str(x[1])+str(x[2])), list(zip(list_a, list_b, list_c))))]
print(longshot) # [True, False, True]
Ma0
  • 15,057
  • 4
  • 35
  • 65