1

So this is a common question but I cant find an answer that fits this particular scenario.

So I have a Dataframe with columns for genres eg "Drama, Western" and one hot encoded versions of the genres so for the drama and western there is a 1 in both columns but where its just Western genre its 1 for that column 0 for drama.

I want a filtered dataframe containing rows with only Western and no other genre. Im trying to oversample for a model as it is a minor class but I don't want to increase other genre counts as a byproduct

There are multiple rows so I can't use the index and there are multiple genres so I can't use a condition like df[(df['Western']==1) & (df['Drama']==0) without having to account for 24 genres.

Index | Genre           |  Drama | Western | Action | genre 4 |
   0    Drama, Western       1        1         0         0
   1    Western              0        1         0         0
   3    Action, Western      0        1         1         0
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Digital Moniker
  • 281
  • 1
  • 12

3 Answers3

2

If I understand your question correctly, you want those rows where only 'Western' is 1, i.e. the genre is only Western, nothing else.

Why do you have to use the encoded columns then? Just use the original 'Genre' column where the data is in string format. No need to overcomplicate things.

new_df = df[df['Genre']=='Western']
S R Maiti
  • 237
  • 3
  • 13
  • I can't believe I didnt just try this I thought for sure it would also return Drama, Western etc.... I'm embarrassed of myself for this one...thanks though – Digital Moniker Mar 26 '21 at 16:35
  • @Goldenhigh No worries! We all have had moments like that, when we can't see what's right in front of us :-) – S R Maiti Mar 26 '21 at 18:15
1

Make a column_list of genre like column_list = ['Western', 'Drama', 'Action', ...] and find its sum, if its sum is equal to 1, then we can compare the value of 'Western' column if it is equal to 1. Try this out, this should return the Index of row where only 'Western' is 1:

column_list = ['Western', 'Drama', 'Action', ...]
df.loc[df[column_list].sum(axis=1)==1 and df['Western']==1, 'Index']
Cute Panda
  • 1,468
  • 1
  • 6
  • 11
  • AI will also try this as its along the lines of what I thought I would have to do and I can learn from it but you'll see my embarrassment below at how easy the solution was... – Digital Moniker Mar 26 '21 at 16:37
1

If you haven't got the Genre column, you could do

df[
    (df['Western']==1)
    &
    (df[df.columns.difference(['Western'])]==0).all(axis=1)
]
Max Pierini
  • 2,027
  • 11
  • 17