4

My dataset is quite imbalanced. The two minority classes each contain half of the sample in the majority class. My RNN model is not able to learn anything about the least populated class.

I'm trying to use the imbalanced-learn library. For instance:

sm = SMOTE(random_state=42, n_jobs=-1, k_neighbors=10)
X_train, y_train = sm.fit_resample(train.drop(['label], axis=1), train['label'])

works if train.drop(['label] contains just the values of the used features. The problem is that my DataFrame contains one additional columns containing strings as values: I cannot drop it since those strings are the input for my RNN. And if I drop it, I would not be able to tell to which row of the oversampled dataset those strings belong to.

Is there a way to keep all the columns and tell the function which columns to use for oversampling?

wrong_path
  • 376
  • 1
  • 6
  • 18

2 Answers2

4

If the string column is an input to your RNN then, assuming that you plan to encode it somehow (one-hot encoding for example), then just encode that column before oversampling and then run the oversampling with your new encoded columns instead of the string column.

  • Thanks. The problem is that if I replace the strings with label encoded NumPy arrays (so that each character corresponds to an integer), imbalanced-learn complains: *setting an array element with a sequence*. – wrong_path Sep 02 '19 at 13:36
  • I could in principle split that columns into multiple columns each with a single number... Is there a better and faster solution? – wrong_path Sep 02 '19 at 13:42
3

For those who need to do something similar, a co-author of the library suggested me to use SMOTENC, which can handle also categorical variables (like strings).

wrong_path
  • 376
  • 1
  • 6
  • 18