I need to preprocess a column for ML, but my feature contains more than one genre - in alphabetical order for each string (that is, my idea to use .startswith for the first genre isn't working). I'm new to Python, and this function is the only way I figured out - but it produces too many 'Other's since most of the movies in a database have more than one genre. Can you kindly suggest more optimal solutions?
cols_to_check = ['Action','Drama','Comedy', 'Romance', 'History', 'War']
def update_genre(row):
x = row['genre']
if x == 'Action':
row["Genre"] = 'Action'
elif x == 'Comedy':
row["Genre"] = 'Comedy'
elif x == 'Drama':
row["Genre"] = 'Drama'
elif x == 'Romance':
row["Genre"] = 'Romance'
elif x == 'War':
row["Genre"] = 'War'
else:
row["Genre"] = 'Other'
return row
df[["Genre"]] = 0
df= df.apply(update_genre, axis=1)
So above is what I've tried, and I expect to somehow take out a genre - whether is a standalone genre or a substring. My skills aren't sufficient, I suppose.
Data looks like this
Drama 8498
Comedy 5420
Comedy, Drama 2654
Drama, Romance 2529
Comedy, Romance 1777
...
War, Action, Adventure 1
Romance, Thriller, Western 1
Action, Thriller, Western 1
Horror, Comedy, Music 1
Comedy, Sci-Fi, Sport 1