0

I need to preprocess a column for ML, but my feature contains more than one genre - in alphabetical order for each string (that is, my idea to use .startswith for the first genre isn't working). I'm new to Python, and this function is the only way I figured out - but it produces too many 'Other's since most of the movies in a database have more than one genre. Can you kindly suggest more optimal solutions?

cols_to_check = ['Action','Drama','Comedy', 'Romance', 'History', 'War']

def update_genre(row):
        x = row['genre']
        if x == 'Action':
            row["Genre"] = 'Action' 
        elif  x == 'Comedy':
            row["Genre"] = 'Comedy'
        elif  x == 'Drama':
            row["Genre"] = 'Drama'
        elif  x == 'Romance':
            row["Genre"] = 'Romance'
        elif  x == 'War':
            row["Genre"] = 'War'
        else:
            row["Genre"] = 'Other'

        return row

df[["Genre"]] = 0
df= df.apply(update_genre, axis=1)  

So above is what I've tried, and I expect to somehow take out a genre - whether is a standalone genre or a substring. My skills aren't sufficient, I suppose.

Data looks like this

Drama                         8498
Comedy                        5420
Comedy, Drama                 2654
Drama, Romance                2529
Comedy, Romance               1777
                              ... 
War, Action, Adventure           1
Romance, Thriller, Western       1
Action, Thriller, Western        1
Horror, Comedy, Music            1
Comedy, Sci-Fi, Sport            1
James Z
  • 12,209
  • 10
  • 24
  • 44
Maria Li
  • 1
  • 2
  • What does your data look like? (please in a reproducible **text** format) – mozway Oct 29 '22 at 05:00
  • @ mozway this is what value_count() prints - does it help? Drama 8498 Comedy 5420 Comedy, Drama 2654 Drama, Romance 2529 Comedy, Romance 1777 ... War, Action, Adventure 1 Romance, Thriller, Western 1 Action, Thriller, Western 1 Horror, Comedy, Music 1 Comedy, Sci-Fi, Sport 1 so it may be just 'Comedy' or 'Comedy, Romance' in one cell, for example – Maria Li Oct 29 '22 at 05:05
  • please [edit](https://stackoverflow.com/posts/74242917/edit) the question with the details – mozway Oct 29 '22 at 05:07
  • I added the details to the question, as best as I could. I'm particlarly struggling with multiple genres in one cell, such as 'War, Action, Adventure' for 1 film. And I can't attach piture to show better( – Maria Li Oct 29 '22 at 05:16
  • So you have a Series with your genres as index? What about the expected output? – mozway Oct 29 '22 at 05:51
  • that' a dataframe where genre is a column. I need to extract the first genre in a cell, for example. Using iloc I was only able to extract first letter, not the first word @mozway – Maria Li Oct 29 '22 at 05:58
  • Can you please provide a minimal version of your DataFrame (max 10 rows) and the matching expected output? – mozway Oct 29 '22 at 06:01
  • I'd love to, but I don't know how I can do that.@mozway – Maria Li Oct 29 '22 at 06:04
  • see [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – mozway Oct 29 '22 at 06:18

1 Answers1

0
df = df.reindex(sorted(df.columns), axis=1)

This should give the result you want. or

df.sort_index(axis=1, ascending=False)

If the ascending parameter is true, your columns will be sorted alphabetically.

ali
  • 17
  • 5
  • I tried, it doesn't solve the problem - I still have multiple genres in one cell, such as 'War, Action, Adventure' for a film. To process further, I need to keep only one – Maria Li Oct 29 '22 at 05:17
  • I don't need to sort alphabetically - it doesn't help me in any way – Maria Li Oct 29 '22 at 05:28