1

The problem is I have a dataset as follows from a csv

viewer id      movie id       movie Name      rating
1              2              XXX             4
1              3              DDD             3
1              4              YYY             5
2              2              XXX             4
3              2              XXX             Not Available

I'm trying to find all movies that have at least 2 ratings AND that have an average rating of 4. Under the rating column there are also 'Not available' values. With Pandas, I'd like to show that the movie here which would be found with a query is XXX

I tried using groupby but am not able to also include the average rating. I converted the Not available rating to nan to get rid of the 'object' issue stopping me from calculating a mean.

  • Welcome to Stack Overflow! Check out the [tour]. Converting the "Not Available" to NaN is a step in the right direction, so much so that there's no point in showing the original df. (But FWIW, you can use the `na_values` argument to `read_csv` to simplify that in the future.) Please show us the df after, and please show the code you tried that didn't work, and what went wrong. For more details, see [mre]. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). And for more tips, like how to write a good title, check out [ask]. – wjandrea Aug 08 '23 at 22:37

2 Answers2

0

You can use groupby.filter, where you check how many valid ratings the movie has and do the average (and check it it's equal to 4):

x = df.groupby("movie id").filter(
    lambda x: (valid_ratings := x["rating"].ne("Not Available")).sum() >= 2
    and x.loc[valid_ratings, "rating"].astype(int).mean() == 4
)

print(x["movie Name"].unique())

Prints:

['XXX']
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

Seems like you were on the right track. Indeed, first we need to normalise the data by converting non-numeric values to NaN.

Then we can group the data using .groupby(...) as you mentioned and use .agg(...) to apply our aggregate functions to rating column:

  • count for getting count of movies, and
  • mean to calculate the average rating

Finally, we filter and print the results.

Here's the complete snippet:

import pandas as pd

data = {
    'viewer id': [1, 1, 1, 2, 3],
    'movie id': [2, 3, 4, 2, 2],
    'movie Name': ['XXX', 'DDD', 'YYY', 'XXX', 'XXX'],
    'rating': [4, 3, 5, 4, 'Not Available']
}
df = pd.DataFrame(data)

# Convert 'Not available' to NaN and ensure ratings are numeric
df['rating'] = pd.to_numeric(df['rating'], errors='coerce')

# Group & aggregate
grouped = df.groupby('movie Name').agg(
    num_ratings=('rating', 'count'),
    avg_rating=('rating', 'mean'))

filtered = grouped[(grouped['num_ratings'] >= 2) & (grouped['avg_rating'] == 4)]
print(filtered)

Output:

              num_ratings  avg_rating
movie Name                         
XXX                  2         4.0
wjandrea
  • 28,235
  • 9
  • 60
  • 81
miloserdow
  • 1,051
  • 1
  • 7
  • 27
  • FWIW, it's a little simpler if you do: `grouped['num_ratings'].ge(2) & grouped['avg_rating'].eq(4)` – wjandrea Aug 08 '23 at 22:51