python data science Find movies with highest female ratings

Question

I am working on an assignment for my Data Science class. I just need help getting started, as I'm having trouble understanding how to use pandas to group and selecting DISTINCT values.

I need to find the movies with the HIGHEST RATINGS by FEMALES, my code returns me movies with ratings = 5, and gender = 'F', but it also repeats the same movie over and over again, since there are more than 1 users. I'm not sure how to just show movie, count of 5-star ratings, and gender = F. below is my code:

import pandas as pd
import os
m = pd.read_csv('movies.csv')
u = pd.read_csv('users.csv')
r = pd.read_csv('ratings.csv')

ur = pd.merge(u,r)
data = pd.merge(m,ur)

df = pd.DataFrame(data)

top10 = df.loc[(df.gender == 'F')&(df.rating == 5)]
print(top10)

the data files can be downloaded here

I just need some help getting started, theres alot more to the homework, but once I figure this out I can do the rest. Just need a jump-start. thank you very much

mv_id title genres rating user_id gender

1       Toy Story (1995)   Animation|Children's|Comedy  5   1   F    
2       Jumanji (1995)     Adventure|Children's|Fantasy 5   2   F        
3       Grumpier Old Men (1995) Comedy|Romance          5   3   F            
4       Waiting to Exhale (1995)    Comedy|Drama        5   4   F        
5       Father of the Bride Part II (1995)  Comedy      5   5   F

Don't provide external links rather provide some sample data in text. — Sociopath, Sep 24 '18 at 06:13
Possible duplicate of [Drop all duplicate rows in Python Pandas](https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas) — girlvsdata, Sep 24 '18 at 06:18
Can you provide how your df, top10 and desired top 10 looks like ? — asimo, Sep 24 '18 at 06:18

score 0 · Accepted Answer · answered Sep 24 '18 at 09:36

I would try to do the filtering operation on as little data as possible. To select 5-star ratings of female users, there's no need for the movie metadata (movies.csv). It can be done on the ur data, which is easier than on the df.

# filter the data in `ur`
f_5s_ratings = ur.loc[(ur.gender == 'F')&(ur.rating == 5)]

# count rows per `movie_id`
abs_num_f_5s_ratings = f_5s_ratings.groupby("movie_id").size()

In abs_num_f_5s_ratings you now have a DataFrame counting the total number of 5-star ratings by female users per movie_id:

movie_id
1       253
2        15
3        14
...

If you join that data on the key movie_id with m as a new column (I'll leave it as an exercise to you), you can then sort by this value to get your top 10 movies with absolute number of 5-star ratings by females.

python data science Find movies with highest female ratings

mv_id title genres rating user_id gender

1 Answers1