Count items greater than a value in pandas groupby

Question

I have the Yelp dataset and I want to count all reviews which have greater than 3 stars. I get the count of reviews by doing this:

reviews.groupby('business_id')['stars'].count()

Now I want to get the count of reviews which had more than 3 stars, so I tried this by taking inspiration from here:

reviews.groupby('business_id')['stars'].agg({'greater':lambda val: (val > 3).count()})

But this just gives me the count of all stars like before. I am not sure if this is the right way to do it? What am I doing incorrectly here. Does the lambda expression not go through each value of the stars column?

EDIT: Okay I feel stupid. I should have used the sum function instead of count to get the value of elements greater than 3, like this:

reviews.groupby('business_id')['stars'].agg({'greater':lambda val: (val > 3).sum()})

score 25 · Accepted Answer · answered Nov 20 '16 at 23:55

25

You can try to do :

reviews[reviews['stars'] > 3].groupby('business_id')['stars'].count()

answered Nov 20 '16 at 23:55

Mohamed AL ANI

2,012
1
12
29

Cool! That works. Thanks so much. I am a bit new to pandas, so do you mind telling me a bit about how your grouping method works as compared to the way I have grouped data. Is either of them more efficient? – rookie Nov 21 '16 at 00:16
It's better I think to use "mask" syntax when filtering your dataset than to go through all your data. – Mohamed AL ANI Nov 21 '16 at 00:27
4

http://www.webpages.uidaho.edu/~stevel/504/Pandas%20DataFrame%20Notes.pdf This is a really good cheat sheet you can use when dealing with pandas – Mohamed AL ANI Nov 21 '16 at 00:29
Thanks, really appreciate your help! – rookie Nov 21 '16 at 00:35
3

however, this will fail to tell you which business_id's have a count of zero – haitham Feb 15 '18 at 00:45

Esben Eickhardt · Answer 2 · 2022-02-03T08:57:02.720

5

As I also wanted to rename the column and to run multiple functions on the same column, I came up with the following solution:

# Counting both over and under
reviews.groupby('business_id')\
       .agg(over=pandas.NamedAgg(column='stars', aggfunc=lambda x: (x > 3).sum()), 
            under=pandas.NamedAgg(column='stars', aggfunc=lambda x: (x < 3).sum()))\
       .reset_index()

The pandas.NamedAgg allows you to create multiple new columns now that the functionality was removed in newer versions of pandas.

edited Feb 03 '22 at 08:57

answered Feb 24 '20 at 15:01

Esben Eickhardt

3,183
2
35
56

1

Short format now available: `reviews.groupby('business_id').agg(over=('stars', lambda x: (x > 3).sum()), under=('stars', lambda x: (x < 3).sum())).reset_index()` – Colin Anthony Sep 21 '20 at 07:21

Zacks Shen · Answer 3 · 2021-10-05T19:13:15.863

Correct Query

Python

# Pass a df and apply the lambda function to column stars
reviews.groupby('business_id').apply(lambda df: sum(df.stars > 3))

Code Explanation

lambda df: sum(df.stars > 3)

This lambda function requires a pandas DataFrame instance then filter if df.stars > 3. If then, the lambda function gets a True else False. Finally, sum the True records. Since I applied groupby before performing this lambda function, it will sum if df.stars > 3 for each group.

Equivalent SQL Statement

SELECT
    business_id,
    SUM(IF(starts > 3, 1, 0)) AS starts_>3
FROM reviews
GROUP BY business_id;

OR

SELECT
    business_id,
    COUNT(IF(starts > 3, 1, NULL)) AS starts_>3
FROM reviews
GROUP BY business_id;

Wrong Query

Python

reviews[reviews.stars > 3].groupby('business_id').size()

OR

reviews[reviews.stars > 3].groupby('business_id')['stars'].count()

Equivalent SQL Statement

SELECT
    business_id,
    SUM(IF(starts > 3, 1, 0)) AS starts_>3
WHERE starts > 3
FROM reviews
GROUP BY business_id;

OR

SELECT
    business_id,
    COUNT(IF(starts > 3, 1, NULL)) AS starts_>3
FROM reviews
WHERE starts > 3
GROUP BY business_id;

Why Wrong?

As you can see, the wrong Python query used reviews[reviews.stars > 3] to filter the stars that are greater than 3 before groupby('business_id), which is equal to applying WHERE stars > 3 before GROUP BY business_id in SQL.

Therefore, assume you have a business_id with only records stars <= 3. The wrong query will IGNORE this business_id. And you WON'T count them.

Any improvement?

Yes. You can improve the python query to rename the query result. Pandas is not as convenient as PySpark, but we can still name the column name.

# Pass a df and apply the lambda function to column stars
lambda_func = lambda df: pd.Series({'stars_>3': df.stars > 3})
reviews.groupby('business_id').apply(lambda_func)

Evaluation

Generate Sample Dataset

You can use the following code for evaluation:

import pandas as pd
import random

# define business_ids
business_ids = range(1, 4)

# define stars
stars = range(1, 6)

# Generate a sample table reviews
reviews = pd.DataFrame(columns = ['review_id', 'business_id', 'stars'])
for business_id in business_ids:
    for i in range(random.randrange(1, 5)): # Assume each business_id has 1~4 reviews
        review = [len(reviews)+1, business_id, random.choice(stars)]
        reviews.loc[len(reviews)] = review
reviews

My sample dataset:

	review_id	business_id	stars
0	1	1	4
1	2	1	5
2	3	1	4
3	4	1	1
4	5	2	3
5	6	2	5
6	7	2	2
7	8	2	3
8	9	3	3
9	10	3	1
10	11	3	3

Correct Python Query

"""
business_id, stars_>3
1, 3
2, 1
3, 0
"""
# Pass a df and apply the lambda function to column stars
lambda_func = lambda df: pd.Series({'stars_>3': sum(df.stars > 3)})
reviews.groupby('business_id').apply(lambda_func)

	stars_>3
business_id
1	3
2	1
3	0

Wrong Python Query

reviews[reviews.stars > 3].groupby('business_id')['stars'].count()

Output:

business_id
1    3
2    1

Please add some explanation how your code works and why your answer may be better from the others. Thanks! — sanastasiadis, Oct 04 '21 at 07:39

score 1 · Answer 4 · answered May 22 '20 at 17:13

A bit late, but my solution is:

reviews.groupby('business_id').stars.apply(lambda x: len(x[x>3]) )

I came across this thread in search of finding "what is the fraction of values above X in a given GroupBy". Here is the solution if anyone is interested:

reviews.groupby('business_id').stars.apply(lambda x: len(x[x>3])/len(x) )

score 1 · Answer 5 · answered May 28 '20 at 13:20

1

I quite like using method chaining with Pandas as I find it easier to read. I haven't tried it but I think this should also work

reviews.query("stars > 3").groupby("business_id").size()

answered May 28 '20 at 13:20

Jonny Brooks

3,169
3
19
22

score 0 · Answer 6 · answered Feb 03 '22 at 09:01

0

For perfromance first create mask and then aggregate sum:

(reviews['stars'] > 3).groupby(reviews['business_id']).sum().reset_index()

answered Feb 03 '22 at 09:01

jezrael

822,522
95
1,334
1,252

Count items greater than a value in pandas groupby

6 Answers6

Correct Query

Python

Code Explanation

Equivalent SQL Statement

Wrong Query

Python

Equivalent SQL Statement

Why Wrong?

Any improvement?

Evaluation

Generate Sample Dataset

Correct Python Query

Wrong Python Query

Linked