Create a dictionary of dataframes for each unique user
- Use a regular expression,
'\((\d+)\)'
to extract the digits, \d
, from between the parenthesis and assign the value to movies['Year']
- Instead of repeatedly filtering the dataframe for different users, this will add each user to a dictionary as the key, and the value will be the dataframe filtered for that user.
- There are 162541 unique
userId
values, so instead of using df.userId.unique()
, use a list of the specific userId
values you're interested in.
- Also see the answer to How to rotate seaborn barplot x-axis tick labels, which is related to this MovieLens dataset.
# question 1: create a column for the year extracted from the title
# extracts the digits between parenthesis
# does not change the title column
df['Year'] = df.title.str.extract('\((\d+)\)')
# create dict of dataframes for each user
userid_movies = dict()
for user in [10, 15, 191]: # df.userId.unique() = 162541 unique users
data = df[df.userId == user]
userid_movies[user] = data
# get data for user 191; assumes ids are int. if not, use '191'
userid_movies[191] # if you're using jupyter, don't use print
Example
import pandas as pd
# load movies
movies = pd.read_csv('data/ml-25m/movies.csv')
# extract year
movies['Year'] = movies.title.str.extract('\((\d+)\)')
# display head
movieId title genres Year
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1995
1 2 Jumanji (1995) Adventure|Children|Fantasy 1995
2 3 Grumpier Old Men (1995) Comedy|Romance 1995
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 1995
4 5 Father of the Bride Part II (1995) Comedy 1995
# load ratings
ratings = pd.read_csv('data/ml-25m/ratings.csv')
# merge on movieId
df = pd.merge(movies, ratings, on='movieId').reset_index(drop=True)
# display df
movieId title genres Year userId rating timestamp
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1995 2 3.5 1141415820
1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1995 3 4.0 1439472215
2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1995 4 3.0 1573944252
3 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1995 5 4.0 858625949
4 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1995 8 4.0 890492517
# dict of dataframes
# there are 162541 unique userId values, so instead of using df.userId.unique()
# use a list of the specific Id values you're interested in
userid_movies = dict()
for user in [10, 15, 191]:
data = df[df.userId == user].reset_index(drop=True)
userid_movies[user] = data
# display(userid_movies[191].head())
movieId title genres Year userId rating timestamp
0 68135 17 Again (2009) Comedy|Drama 2009 191 3.0 1473704208
1 68791 Terminator Salvation (2009) Action|Adventure|Sci-Fi|Thriller 2009 191 5.0 1473704167
2 68954 Up (2009) Adventure|Animation|Children|Drama 2009 191 4.0 1473703994
3 69406 Proposal, The (2009) Comedy|Romance 2009 191 4.0 1473704198
4 69644 Ice Age: Dawn of the Dinosaurs (2009) Action|Adventure|Animation|Children|Comedy|Romance 2009 191 1.5 1473704242