24

I have the Yelp dataset and I want to count all reviews which have greater than 3 stars. I get the count of reviews by doing this:

reviews.groupby('business_id')['stars'].count()

Now I want to get the count of reviews which had more than 3 stars, so I tried this by taking inspiration from here:

reviews.groupby('business_id')['stars'].agg({'greater':lambda val: (val > 3).count()})

But this just gives me the count of all stars like before. I am not sure if this is the right way to do it? What am I doing incorrectly here. Does the lambda expression not go through each value of the stars column?

EDIT: Okay I feel stupid. I should have used the sum function instead of count to get the value of elements greater than 3, like this:

reviews.groupby('business_id')['stars'].agg({'greater':lambda val: (val > 3).sum()})
Community
  • 1
  • 1
rookie
  • 1,168
  • 3
  • 14
  • 25

6 Answers6

25

You can try to do :

reviews[reviews['stars'] > 3].groupby('business_id')['stars'].count()
Mohamed AL ANI
  • 2,012
  • 1
  • 12
  • 29
  • Cool! That works. Thanks so much. I am a bit new to pandas, so do you mind telling me a bit about how your grouping method works as compared to the way I have grouped data. Is either of them more efficient? – rookie Nov 21 '16 at 00:16
  • It's better I think to use "mask" syntax when filtering your dataset than to go through all your data. – Mohamed AL ANI Nov 21 '16 at 00:27
  • 4
    http://www.webpages.uidaho.edu/~stevel/504/Pandas%20DataFrame%20Notes.pdf This is a really good cheat sheet you can use when dealing with pandas – Mohamed AL ANI Nov 21 '16 at 00:29
  • Thanks, really appreciate your help! – rookie Nov 21 '16 at 00:35
  • 3
    however, this will fail to tell you which business_id's have a count of zero – haitham Feb 15 '18 at 00:45
5

As I also wanted to rename the column and to run multiple functions on the same column, I came up with the following solution:

# Counting both over and under
reviews.groupby('business_id')\
       .agg(over=pandas.NamedAgg(column='stars', aggfunc=lambda x: (x > 3).sum()), 
            under=pandas.NamedAgg(column='stars', aggfunc=lambda x: (x < 3).sum()))\
       .reset_index()

The pandas.NamedAgg allows you to create multiple new columns now that the functionality was removed in newer versions of pandas.

Esben Eickhardt
  • 3,183
  • 2
  • 35
  • 56
  • 1
    Short format now available: `reviews.groupby('business_id').agg(over=('stars', lambda x: (x > 3).sum()), under=('stars', lambda x: (x < 3).sum())).reset_index()` – Colin Anthony Sep 21 '20 at 07:21
3

Correct Query

Python

# Pass a df and apply the lambda function to column stars
reviews.groupby('business_id').apply(lambda df: sum(df.stars > 3))

Code Explanation

lambda df: sum(df.stars > 3)

This lambda function requires a pandas DataFrame instance then filter if df.stars > 3. If then, the lambda function gets a True else False. Finally, sum the True records. Since I applied groupby before performing this lambda function, it will sum if df.stars > 3 for each group.


Equivalent SQL Statement

SELECT
    business_id,
    SUM(IF(starts > 3, 1, 0)) AS starts_>3
FROM reviews
GROUP BY business_id;

OR

SELECT
    business_id,
    COUNT(IF(starts > 3, 1, NULL)) AS starts_>3
FROM reviews
GROUP BY business_id;

Wrong Query

Python

reviews[reviews.stars > 3].groupby('business_id').size()

OR

reviews[reviews.stars > 3].groupby('business_id')['stars'].count()

Equivalent SQL Statement

SELECT
    business_id,
    SUM(IF(starts > 3, 1, 0)) AS starts_>3
WHERE starts > 3
FROM reviews
GROUP BY business_id;

OR

SELECT
    business_id,
    COUNT(IF(starts > 3, 1, NULL)) AS starts_>3
FROM reviews
WHERE starts > 3
GROUP BY business_id;

Why Wrong?

As you can see, the wrong Python query used reviews[reviews.stars > 3] to filter the stars that are greater than 3 before groupby('business_id), which is equal to applying WHERE stars > 3 before GROUP BY business_id in SQL.

Therefore, assume you have a business_id with only records stars <= 3. The wrong query will IGNORE this business_id. And you WON'T count them.


Any improvement?

Yes. You can improve the python query to rename the query result. Pandas is not as convenient as PySpark, but we can still name the column name.

# Pass a df and apply the lambda function to column stars
lambda_func = lambda df: pd.Series({'stars_>3': df.stars > 3})
reviews.groupby('business_id').apply(lambda_func)

Evaluation

Generate Sample Dataset

You can use the following code for evaluation:

import pandas as pd
import random

# define business_ids
business_ids = range(1, 4)

# define stars
stars = range(1, 6)

# Generate a sample table reviews
reviews = pd.DataFrame(columns = ['review_id', 'business_id', 'stars'])
for business_id in business_ids:
    for i in range(random.randrange(1, 5)): # Assume each business_id has 1~4 reviews
        review = [len(reviews)+1, business_id, random.choice(stars)]
        reviews.loc[len(reviews)] = review
reviews

My sample dataset:

review_id business_id stars
0 1 1 4
1 2 1 5
2 3 1 4
3 4 1 1
4 5 2 3
5 6 2 5
6 7 2 2
7 8 2 3
8 9 3 3
9 10 3 1
10 11 3 3

Correct Python Query

"""
business_id, stars_>3
1, 3
2, 1
3, 0
"""
# Pass a df and apply the lambda function to column stars
lambda_func = lambda df: pd.Series({'stars_>3': sum(df.stars > 3)})
reviews.groupby('business_id').apply(lambda_func)
stars_>3
business_id
1 3
2 1
3 0

Wrong Python Query

reviews[reviews.stars > 3].groupby('business_id')['stars'].count()

Output:

business_id
1    3
2    1
Zacks Shen
  • 99
  • 4
1

A bit late, but my solution is:

reviews.groupby('business_id').stars.apply(lambda x: len(x[x>3]) )

I came across this thread in search of finding "what is the fraction of values above X in a given GroupBy". Here is the solution if anyone is interested:

reviews.groupby('business_id').stars.apply(lambda x: len(x[x>3])/len(x) )
Ivan P.
  • 480
  • 2
  • 5
  • 10
1

I quite like using method chaining with Pandas as I find it easier to read. I haven't tried it but I think this should also work

reviews.query("stars > 3").groupby("business_id").size()
Jonny Brooks
  • 3,169
  • 3
  • 19
  • 22
0

For perfromance first create mask and then aggregate sum:

(reviews['stars'] > 3).groupby(reviews['business_id']).sum().reset_index()
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252