1

Hi I need to find the movies that have more than one genre in movielens project where the genre is not a single column instead of its multiple columns like genre1, genre 2, etc and I tried using the item.sum(axis=1) but it didn't fetch me the required result.

I also tried the following code based on a solution thread but it didnt work.

tempdf = item[[column for column in item if 'genre' in column]]
number_of_genres = tempdf.sum(axis=1)
sub =item[number_of_genres > 1]
print(sub)

Can someone please help?

Kriti Pawar
  • 832
  • 7
  • 15
Ananya
  • 13
  • 3
  • Please read [how to make a good reproducible pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – It_is_Chris Mar 17 '21 at 20:36

1 Answers1

1

Asuming you use the MovieLens 100k data set (obtained from https://grouplens.org/datasets/movielens/).

It comes with a file called 'u.genre' which contains movie information including one hot encoded genres.

Load the data:

import pandas as pd
dt_dir_name = '/path/to/ml-100k/'

genres = ['unknown', 'Action' ,'Adventure' ,'Animation',
          'Children' ,'Comedy' ,'Crime' ,'Documentary' ,'Drama' ,'Fantasy',
          'Film-Noir' ,'Horror' ,'Musical' ,'Mystery' ,'Romance' ,'Sci-Fi',
          'Thriller' ,'War' ,'Western']
movie_data = pd.read_csv(dt_dir_name +'/'+ 'u.item', delimiter='|', names=['movie id' ,'movie title' ,'release date' ,'video release date' ,
          'IMDb URL'] + genres)

print('movie data', movie_data.shape)

Then we search for the movies with more than one genre and save the title in a list:

movies_with_several_genres = []
for _, movie in movie_data.iterrows():
    if movie[genres].sum() > 1:
        movies_with_several_genres.append(movie['movie title'])

print(movies_with_several_genres

Or more pythonic:

print([movie['movie title'] for _, movie in movie_data.iterrows() if movie[genres].sum() > 1])
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129