How to preprocess categorical data in this case?

Question

I need to preprocess a column for ML, but my feature contains more than one genre - in alphabetical order for each string (that is, my idea to use .startswith for the first genre isn't working). I'm new to Python, and this function is the only way I figured out - but it produces too many 'Other's since most of the movies in a database have more than one genre. Can you kindly suggest more optimal solutions?

cols_to_check = ['Action','Drama','Comedy', 'Romance', 'History', 'War']

def update_genre(row):
        x = row['genre']
        if x == 'Action':
            row["Genre"] = 'Action' 
        elif  x == 'Comedy':
            row["Genre"] = 'Comedy'
        elif  x == 'Drama':
            row["Genre"] = 'Drama'
        elif  x == 'Romance':
            row["Genre"] = 'Romance'
        elif  x == 'War':
            row["Genre"] = 'War'
        else:
            row["Genre"] = 'Other'

        return row

df[["Genre"]] = 0
df= df.apply(update_genre, axis=1)

So above is what I've tried, and I expect to somehow take out a genre - whether is a standalone genre or a substring. My skills aren't sufficient, I suppose.

Data looks like this

Drama                         8498
Comedy                        5420
Comedy, Drama                 2654
Drama, Romance                2529
Comedy, Romance               1777
                              ... 
War, Action, Adventure           1
Romance, Thriller, Western       1
Action, Thriller, Western        1
Horror, Comedy, Music            1
Comedy, Sci-Fi, Sport            1

What does your data look like? (please in a reproducible **text** format) — mozway, Oct 29 '22 at 05:00
@ mozway this is what value_count() prints - does it help? Drama 8498 Comedy 5420 Comedy, Drama 2654 Drama, Romance 2529 Comedy, Romance 1777 ... War, Action, Adventure 1 Romance, Thriller, Western 1 Action, Thriller, Western 1 Horror, Comedy, Music 1 Comedy, Sci-Fi, Sport 1 so it may be just 'Comedy' or 'Comedy, Romance' in one cell, for example — Maria Li, Oct 29 '22 at 05:05
please [edit](https://stackoverflow.com/posts/74242917/edit) the question with the details — mozway, Oct 29 '22 at 05:07
I added the details to the question, as best as I could. I'm particlarly struggling with multiple genres in one cell, such as 'War, Action, Adventure' for 1 film. And I can't attach piture to show better( — Maria Li, Oct 29 '22 at 05:16
So you have a Series with your genres as index? What about the expected output? — mozway, Oct 29 '22 at 05:51
that' a dataframe where genre is a column. I need to extract the first genre in a cell, for example. Using iloc I was only able to extract first letter, not the first word @mozway — Maria Li, Oct 29 '22 at 05:58
Can you please provide a minimal version of your DataFrame (max 10 rows) and the matching expected output? — mozway, Oct 29 '22 at 06:01
see [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — mozway, Oct 29 '22 at 06:18

ali · Answer 1 · 2022-10-29T05:14:06.957

0

df = df.reindex(sorted(df.columns), axis=1)

This should give the result you want. or

df.sort_index(axis=1, ascending=False)

If the ascending parameter is true, your columns will be sorted alphabetically.

edited Oct 29 '22 at 05:14

answered Oct 29 '22 at 05:12

ali

17
5

I tried, it doesn't solve the problem - I still have multiple genres in one cell, such as 'War, Action, Adventure' for a film. To process further, I need to keep only one – Maria Li Oct 29 '22 at 05:17
I don't need to sort alphabetically - it doesn't help me in any way – Maria Li Oct 29 '22 at 05:28

How to preprocess categorical data in this case?

1 Answers1