My dataset is quite imbalanced. The two minority classes each contain half of the sample in the majority class. My RNN model is not able to learn anything about the least populated class.
I'm trying to use the imbalanced-learn
library. For instance:
sm = SMOTE(random_state=42, n_jobs=-1, k_neighbors=10)
X_train, y_train = sm.fit_resample(train.drop(['label], axis=1), train['label'])
works if train.drop(['label]
contains just the values of the used features. The problem is that my DataFrame contains one additional columns containing strings as values: I cannot drop it since those strings are the input for my RNN. And if I drop it, I would not be able to tell to which row of the oversampled dataset those strings belong to.
Is there a way to keep all the columns and tell the function which columns to use for oversampling?