Compare values across multiple columns in pandas and count the instances in which values in the last column is higher than the others

Question

I have a DataFrame that looks like this:

What I would like to do is to compare the values in all four columns (A, B, C, and D) for every row and count the number of times in which D has the smaller value than A, B, or C for each row and add it into the 'Count' column. So for instance, 'Count' should be 1 for the second row, the third row, and 2 for the last row.

Thank you in advance!

Please don’t post images of code, data or Tracebacks. Copy and paste it as text then format it as code (select it and type `ctrl-k`). [Why should I not upload images of ... when asking a question?](https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors-when-asking-a-question). [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — wwii, Jan 11 '23 at 17:43

score 2 · Answer 1 · answered Jan 11 '23 at 17:51

2

You can use vectorize the operation using gt and sum methods along an axis:

df['Count'] = df[['A', 'B', 'C']].gt(df['D'], axis=0).sum(axis=1)
print(df)

# Output
   A  B  C  D  Count
0  1  2  3  4      0
1  4  3  2  1      3
2  2  1  4  3      1

answered Jan 11 '23 at 17:51

Corralien

109,409
8
28
52

I think your solution is better than mine due to vectorization. Thank you for posting it. I am a bit confused though - why is it that we need to specify `axis=0` with `.gt()` here? I tried it without and got some odd results that didn't make much sense. I'm still trying to wrap my head around how it's comparing in the background. – Stu Sztukowski Jan 11 '23 at 18:31
1

Because `df['A', 'B', 'C']]` return a DataFrame while `df['D']` returns a Series, so you have to indicate you want to compare along index axis. Replace `gt` by `>` and you will see you can't do any comparison. This is the same thing for `sum` in fact :-) – Corralien Jan 11 '23 at 18:37
1

I think it could be more understandable if you use numpy: `df[['A', 'B', 'C']].to_numpy() > np.vstack(df['D'].to_numpy())`. Try to remove `vstack` and now it doesn't make sense to compare. Is it clear now? (+1 for your solution) – Corralien Jan 11 '23 at 18:41

score 1 · Accepted Answer · answered Jan 11 '23 at 17:48

In the future, please do not post data as an image.

Use a lambda function and compare across all columns, then sum across the columns.

data = {'A': [1,47,4316,8511],
        'B': [4,1,3,4],
        'C': [2,7,9,1],
        'D': [32,17,1,0]
       }

df = pd.DataFrame(data)

df['Count'] = df.apply(lambda x: x['D'] < x, axis=1).sum(axis=1)

Output:

      A  B  C   D  Count
0     1  4  2  32      0
1    47  1  7  17      1
2  4316  3  9   1      3
3  8511  4  1   0      3

Compare values across multiple columns in pandas and count the instances in which values in the last column is higher than the others

2 Answers2