0

I'm doing analysis on movies, and each movie have a genre attribute, it might be several specific genre, like drama, comedy, the data looks like this:

movie_list = [
    {'name': 'Movie 1',
    'genre' :'Action, Fantasy, Horror'},
    {'name': 'Movie 2',
    'genre' :'Action, Comedy, Family'},
    {'name': 'Movie 3',
    'genre' :'Biography, Drama'},
    {'name': 'Movie 4',
    'genre' :'Biography, Drama, Romance'},
    {'name': 'Movie 5',
    'genre' :'Drama'},
    {'name': 'Movie 6',
    'genre' :'Documentary'},
]

The problem is that, how do I do analysis on this? For example, how do I know how many action moviews are here, and how do I query for the category action? Specifically:

  1. How do I get all the categories in this list? So I know each contains how many moviews

  2. How do I query for a certain kind of movies, like action?

  3. Do I need to turn the genre into array?

Currently I can get away the 2nd question with df[df['genre'].str.contains("Action")].describe(), but is there better syntax?

ZK Zhao
  • 19,885
  • 47
  • 132
  • 206

1 Answers1

0

If your data isn't too huge, I would do some pre-processing and get 1 record per genre. That is, I would structure your data frame like this:

 Name    Genre
 Movie 1 Action
 Movie 1 Fantasy
 Movie 1 Horor
 ...

Note the names should be repeated. While this may make your data set much bigger, if your system can handle it it can make data analysis very easy. Use the following code to do the transformation:

import pandas as pd

def reformat_movie_list(movies):
    name = []
    genre = []
    result = pd.DataFrame()
    for movie in movies:
        movie_name = movie["name"]
        movie_genres = movie["genre"].split(",")
        for movie_genre in movie_genres:
             name.append(movie_name.strip())
             genre.append(movie_genre.strip())
    result["name"] = name
    result["genre"] = genre
    return result

In this format, your 3 questions become

  1. How do I get all the categories in this list? So I know each contains how many movies?

    movie_df.groupby("genre").agg("count")

see How to count number of rows in a group in pandas group by object?

  1. How do I query for a certain kind of movies, like action?

    horror_movies = movie_df[movie_df["genre"] == "horror"]

see pandas: filter rows of DataFrame with operator chaining

  1. Do I need to turn the genre into array?

Your de-normalization of the data should take care of it.

Community
  • 1
  • 1
sakurashinken
  • 3,940
  • 8
  • 34
  • 67