Sum of count where values are less than row

Question

I'm using Pandas to come up with new column that will search through the entire column with values [1-100] and will count the values where it's less than the current row.

See [df] example below:

[A][NewCol]
 1 0
 3 2
 2 1
 5 4
 8 5
 3 2

Essentially, for each row I need to look at the entire Column A, and count how many values are less than the current row. So for Value 5, there are 4 values that are less (<) than 5 (1,2,3,3).

What would be the easiest way of doing this?

Thanks!

seems like what you are looking for is your custom version of the rank(A) - checkout https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.rank.html to see if you can achieve the result from the rank function as that would be the cleanest implementation here — marwan, Mar 29 '19 at 20:02

Scott Boston · Accepted Answer · 2019-03-29T20:16:13.940

7

One way to do it like this, use rank with method='min':

df['NewCol'] = (df['A'].rank(method='min') - 1).astype(int)

Output:

   A  NewCol
0  1       0
1  3       2
2  2       1
3  5       4
4  8       5
5  3       2

edited Mar 29 '19 at 20:16

answered Mar 29 '19 at 20:02

Scott Boston

147,308
15
139
187

1

I think the output need little bit adjust . – BENY Mar 29 '19 at 20:04
@Wen-Ben Thanks. – Scott Boston Mar 29 '19 at 20:05
1

@Wen-Ben LOL I knew this could be done with rank. I had to stumble through all of the rank methods. :) – Scott Boston Mar 29 '19 at 20:14
Looks better now :-) – BENY Mar 29 '19 at 20:19
1

Thank you! I was originally thinking of using rank but I dint know you could use the min method and subtract 1. Genius! – Cesar Mar 29 '19 at 20:57

BENY · Answer 2 · 2019-03-29T20:43:26.677

7

I am using numpy broadcast

s=df.A.values
(s[:,None]>s).sum(1)
Out[649]: array([0, 2, 1, 4, 5, 2])

#df['NewCol']=(s[:,None]>s).sum(1)

timing

df=pd.concat([df]*1000)

%%timeit
s=df.A.values
(s[:,None]>s).sum(1)
10 loops, best of 3: 83.7 ms per loop
%timeit (df['A'].rank(method='min') - 1).astype(int)
1000 loops, best of 3: 479 µs per loop

edited Mar 29 '19 at 20:43

answered Mar 29 '19 at 20:02

BENY

317,841
20
164
234

Is possible add timings? I guess rank should be faster, but not 100% sure in large data. – jezrael Mar 29 '19 at 20:20
@jezrael I feel like the same :-) – BENY Mar 29 '19 at 20:39
Is possible add timings? – jezrael Mar 29 '19 at 20:40
Thank you, only not sure what data re used for timeit? – jezrael Mar 29 '19 at 20:42
@jezrael added :-) – BENY Mar 29 '19 at 20:46
@Wen-Ben - Thank you, interesting pandas beat numpy :) – jezrael Mar 29 '19 at 20:48
@jezrael ummm feel like rank is using some thing like binary search to get the result , rather that o(n,n) – BENY Mar 29 '19 at 20:49

score 1 · Answer 3 · answered Mar 29 '19 at 20:02

1

Try this code

A = [Your numbers]
less_than = []
    for element in A:
        counter = 0
        for number in A:
            if number < element:
                counter += 1
        less_than.append(counter)

answered Mar 29 '19 at 20:02

SdahlSean

573
3
13

score 1 · Answer 4 · answered Mar 29 '19 at 20:05

You can do it this way:

import pandas as pd

df = pd.DataFrame({'A': [1,3,2,5,8,3]})

df['NewCol'] = 0
for idx, row in df.iterrows():
    df.loc[idx, 'NewCol'] = (df.loc[:, 'A'] < row.A).sum()

print(df)

   A  NewCol
0  1       0
1  3       2
2  2       1
3  5       4
4  8       5
5  3       2

score 1 · Answer 5 · answered Mar 29 '19 at 20:10

1

Another way is sort and reset index:

m=df.A.sort_values().reset_index(drop=True).reset_index()
m.columns=['new','A']
print(m)

   new  A
0    0  1
1    1  2
2    2  3
3    3  3
4    4  5
5    5  8

answered Mar 29 '19 at 20:10

anky

74,114
11
41
70

score 1 · Answer 6 · answered Mar 29 '19 at 20:14

You didn't specify if speed or memory usage was important (or if you had a very large dataset). The "easiest" way to do it is straightfoward: calculate how many are less then i for each entry in the column and collect those into a new column:

df=pd.DataFrame({'A': [1,3,2,5,8,3]})
col=df['A']
df['new_col']=[ sum(col<i) for i in col ]

print(df)

Result:

   A  new_col
0  1        0
1  3        2
2  2        1
3  5        4
4  8        5
5  3        2

There might be more efficient ways to do this on large datasets, such as sorting your column first.

Sum of count where values are less than row

6 Answers6