0

I'm trying to do random sampling method on a unbalanced dataset to predict the appropriate 'category' for the given 'description'.

df_1['Category'].value_counts().loc[lambda x : x>1]

The categories are too many and uneven. I want to bring them all to an equal level so the machine learning model will not predict always let say 'iam~ki-000' as they are too many.

 iam~ki-000                378
 iam~ki-002                180
 iam~ki-049                 99
 iam~ki-050                 91
 iam~ki-057                 91
                          ... 
 iam~ki-077                  2 

So far I can come up with only one solution and that is very ineffective:(

That is to do an individual calculation to multiply each category to oversample the dataset. There are almost 90 categories in total. Can someone help me out to write a function that aggregates all categories evenly?

ki-057 = dataframe['Category'] == iam~ki-000
df_try = df[ki-057]
df = df.append([df_try]*4,ignore_index=True)
Adolf
  • 9
  • 4
  • This might help: https://stackoverflow.com/questions/48373088/duplicating-training-examples-to-handle-class-imbalance-in-a-pandas-data-frame – isydmr Jan 13 '20 at 09:50
  • @isydmr Thank you this might work. I will update the code after testing. – Adolf Jan 13 '20 at 13:28
  • checkout `RandomOverSampler` class from `imblearn`. This is where I am getting this info from: https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/ – agent18 Dec 31 '20 at 11:56

0 Answers0