2

This screenshot is the sample of the merged movielens dataset, I have two questions:

  1. If I want to extract only user 191 movieid, title, genres and ratings alone, how will I do this?
  2. How can I list out only the years at the end of each movie title?

Any guide will be highly appreciated.

Screenshot of Movielens Dataset

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158

3 Answers3

2

First question; Use a Boolean selection

df[df['userid']=='191']

Second question# Use regex to extract phrases between brackets

df['Year']=df.title.str.extract('\((.*?)\)')
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Would this return both the phrase and the year that appears in movieid 32 as well? – rhug123 Jun 29 '20 at 21:22
  • What do you mean? Phrase will remain in the column title and my solution creates a new column called Year in which the digit year is input. If extracting year, it cant appear with the phrase again in the same column, can it? – wwnde Jun 29 '20 at 21:25
  • 1
    Oh, so your regex will only return a value if the parenthesis contain only digits? I am not too familiar with regex, but this solution is good! – rhug123 Jun 29 '20 at 21:28
  • The regular expression case works perfectly, but the other solution doesn't produce the expected output. Thanks a lot for the quick response. – Adeolu Temitope Olofintuyi Jun 29 '20 at 23:31
  • My guess is they're int, not str. – Trenton McKinney Jun 29 '20 at 23:56
  • Is there a reason why you used : `.*?` over something like `\d+` (curious) – Moondra Jun 30 '20 at 06:22
1

For the first part of your question, you can filter the dataframe.

user191 = df.loc[df['userId']==191]

For the second part of your question, the year appears to always come at the end, so you can take the last part of the string and remove the parenthesis.

df['Year'] = df['title'].str[-5:].str.replace(')','')
rhug123
  • 7,893
  • 1
  • 9
  • 24
1

Create a dictionary of dataframes for each unique user

  • Use a regular expression, '\((\d+)\)' to extract the digits, \d, from between the parenthesis and assign the value to movies['Year']
  • Instead of repeatedly filtering the dataframe for different users, this will add each user to a dictionary as the key, and the value will be the dataframe filtered for that user.
    • There are 162541 unique userId values, so instead of using df.userId.unique(), use a list of the specific userId values you're interested in.
  • Also see the answer to How to rotate seaborn barplot x-axis tick labels, which is related to this MovieLens dataset.
# question 1: create a column for the year extracted from the title
# extracts the digits between parenthesis
# does not change the title column
df['Year'] = df.title.str.extract('\((\d+)\)')

# create dict of dataframes for each user
userid_movies = dict()
for user in [10, 15, 191]:  # df.userId.unique() = 162541 unique users
    data = df[df.userId == user]
    userid_movies[user] = data

# get data for user 191; assumes ids are int. if not, use '191'
userid_movies[191]  # if you're using jupyter, don't use print

Example

import pandas as pd

# load movies
movies = pd.read_csv('data/ml-25m/movies.csv')

# extract year
movies['Year'] = movies.title.str.extract('\((\d+)\)')

# display head
   movieId                               title                                       genres  Year
0        1                    Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995
1        2                      Jumanji (1995)                   Adventure|Children|Fantasy  1995
2        3             Grumpier Old Men (1995)                               Comedy|Romance  1995
3        4            Waiting to Exhale (1995)                         Comedy|Drama|Romance  1995
4        5  Father of the Bride Part II (1995)                                       Comedy  1995

# load ratings
ratings = pd.read_csv('data/ml-25m/ratings.csv')

# merge on movieId
df = pd.merge(movies, ratings, on='movieId').reset_index(drop=True)

# display df
   movieId             title                                       genres  Year  userId  rating   timestamp
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       2     3.5  1141415820
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       3     4.0  1439472215
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       4     3.0  1573944252
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       5     4.0   858625949
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy  1995       8     4.0   890492517

# dict of dataframes
# there are 162541 unique userId values, so instead of using df.userId.unique()
# use a list of the specific Id values you're interested in
userid_movies = dict()
for user in [10, 15, 191]:
    data = df[df.userId == user].reset_index(drop=True)
    userid_movies[user] = data

# display(userid_movies[191].head())
   movieId                                  title                                              genres  Year  userId  rating   timestamp
0    68135                        17 Again (2009)                                        Comedy|Drama  2009     191     3.0  1473704208
1    68791            Terminator Salvation (2009)                    Action|Adventure|Sci-Fi|Thriller  2009     191     5.0  1473704167
2    68954                              Up (2009)                  Adventure|Animation|Children|Drama  2009     191     4.0  1473703994
3    69406                   Proposal, The (2009)                                      Comedy|Romance  2009     191     4.0  1473704198
4    69644  Ice Age: Dawn of the Dinosaurs (2009)  Action|Adventure|Animation|Children|Comedy|Romance  2009     191     1.5  1473704242

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158