Correct Query
Python
# Pass a df and apply the lambda function to column stars
reviews.groupby('business_id').apply(lambda df: sum(df.stars > 3))
Code Explanation
lambda df: sum(df.stars > 3)
This lambda function requires a pandas DataFrame instance then filter if df.stars > 3
. If then, the lambda function gets a True
else False
. Finally, sum
the True
records.
Since I applied groupby
before performing this lambda function, it will sum
if df.stars > 3
for each group.
Equivalent SQL Statement
SELECT
business_id,
SUM(IF(starts > 3, 1, 0)) AS starts_>3
FROM reviews
GROUP BY business_id;
OR
SELECT
business_id,
COUNT(IF(starts > 3, 1, NULL)) AS starts_>3
FROM reviews
GROUP BY business_id;
Wrong Query
Python
reviews[reviews.stars > 3].groupby('business_id').size()
OR
reviews[reviews.stars > 3].groupby('business_id')['stars'].count()
Equivalent SQL Statement
SELECT
business_id,
SUM(IF(starts > 3, 1, 0)) AS starts_>3
WHERE starts > 3
FROM reviews
GROUP BY business_id;
OR
SELECT
business_id,
COUNT(IF(starts > 3, 1, NULL)) AS starts_>3
FROM reviews
WHERE starts > 3
GROUP BY business_id;
Why Wrong?
As you can see, the wrong Python query used reviews[reviews.stars > 3]
to filter the stars that are greater than 3 before groupby('business_id)
, which is equal to applying WHERE stars > 3
before GROUP BY business_id
in SQL.
Therefore, assume you have a business_id
with only records stars <= 3
. The wrong query will IGNORE this business_id
. And you WON'T count them.
Any improvement?
Yes. You can improve the python query to rename the query result. Pandas is not as convenient as PySpark, but we can still name the column name.
# Pass a df and apply the lambda function to column stars
lambda_func = lambda df: pd.Series({'stars_>3': df.stars > 3})
reviews.groupby('business_id').apply(lambda_func)
Evaluation
Generate Sample Dataset
You can use the following code for evaluation:
import pandas as pd
import random
# define business_ids
business_ids = range(1, 4)
# define stars
stars = range(1, 6)
# Generate a sample table reviews
reviews = pd.DataFrame(columns = ['review_id', 'business_id', 'stars'])
for business_id in business_ids:
for i in range(random.randrange(1, 5)): # Assume each business_id has 1~4 reviews
review = [len(reviews)+1, business_id, random.choice(stars)]
reviews.loc[len(reviews)] = review
reviews
My sample dataset:
|
review_id |
business_id |
stars |
0 |
1 |
1 |
4 |
1 |
2 |
1 |
5 |
2 |
3 |
1 |
4 |
3 |
4 |
1 |
1 |
4 |
5 |
2 |
3 |
5 |
6 |
2 |
5 |
6 |
7 |
2 |
2 |
7 |
8 |
2 |
3 |
8 |
9 |
3 |
3 |
9 |
10 |
3 |
1 |
10 |
11 |
3 |
3 |
Correct Python Query
"""
business_id, stars_>3
1, 3
2, 1
3, 0
"""
# Pass a df and apply the lambda function to column stars
lambda_func = lambda df: pd.Series({'stars_>3': sum(df.stars > 3)})
reviews.groupby('business_id').apply(lambda_func)
|
stars_>3 |
business_id |
|
1 |
3 |
2 |
1 |
3 |
0 |
Wrong Python Query
reviews[reviews.stars > 3].groupby('business_id')['stars'].count()
Output:
business_id
1 3
2 1