10

I'm doing a binary classification with CNNs and the data is imbalanced where the positive medical image : negative medical image = 0.4 : 0.6. So I want to use SMOTE to oversample the positive medical image data before training. However, the dimension of the data is 4D (761,64,64,3) which cause the error

Found array with dim 4. Estimator expected <= 2

So, I reshape my train_data:

X_res, y_res = smote.fit_sample(X_train.reshape(X_train.shape[0], -1), y_train.ravel())

And it works fine. Before feed it to CNNs, I reshape it back by:

X_res = X_res.reshape(X_res.shape[0], 64, 64, 3)

Now, I'm not sure is it a correct way to oversample and will the reshape operator change the images' structer?

Salmon
  • 369
  • 1
  • 4
  • 14

3 Answers3

7

I had a similar issue. I had used the reshape function to reshape the image (basically flattened the image)

X_train.shape
(8000, 250, 250, 3)

ReX_train = X_train.reshape(8000, 250 * 250 * 3)
ReX_train.shape
(8000, 187500)

smt = SMOTE()
Xs_train, ys_train = smt.fit_sample(ReX_train, y_train)

Although, this approach is pathetically slow, but helped to improve the performance.

Aditya Bhattacharya
  • 914
  • 2
  • 9
  • 22
  • Also, just as an extension to the previous question. Incase if you are using a pretrained network, and the feature extraction approach, you won't need to reshape back the flattened image. But if not, the image would again have to be reshaped back to the original dimension. Also, there is an article about SMOTE that can help: https://medium.com/@adib0073/how-to-use-smote-for-dealing-with-imbalanced-image-dataset-for-solving-classification-problems-3aba7d2b9cad – Aditya Bhattacharya Feb 02 '20 at 11:09
3
  1. As soon as you flatten an image you are loosing localized information, this is one of the reasons why convolutions are used in image-based machine learning.
  2. 8000x250x250x3 has an inherent meaning - 8000 samples of images, each image of width 250, height 250 and all of them have 3 channels when you do 8000x250*250*3 reshape is just a bunch of numbers unless you use some kind of sequence network to teach its bad.
  3. oversampling is bad for image data, you can do image augmentations (20crop, introducing noise like a gaussian blur, rotations, translations, etc..)
cerofrais
  • 1,117
  • 1
  • 12
  • 32
  • Do you have any quotes to prove your third idea? About "oversampling is bad for image data"? – CuCaRot Jun 18 '20 at 04:14
  • 1
    when i said "oversampling is bad" i meant if we try to just duplicate the number of low-class images, we are not trying to generalize and it will see the same image multiple times which in some cases leads to overfitting. sorry about an uncertain statement again – cerofrais Jun 18 '20 at 08:53
  • Just want to debate about it. So besides augment, do you know any method to deal with imbalanced problem? I am facing with a dataset with major - minor is 98 - 2 which mean only 2% in that dataset is a class – CuCaRot Jun 18 '20 at 09:00
  • 1
    Assuming image data, i would suggest k-fold (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) + image argumentations (https://github.com/albumentations-team/albumentations) still unless you are doing anomaly detection 98-2 is really bad split. – cerofrais Jun 18 '20 at 09:03
  • I think oversampling then only augment the minor class is a good idea (https://medium.com/swlh/how-to-use-smote-for-dealing-with-imbalanced-image-dataset-for-solving-classification-problems-3aba7d2b9cad) – CuCaRot Jun 18 '20 at 11:50
2
  • First Flatten the image
  • Apply SMOTE on this flattened image data and its labels
  • Reshape the flattened image to RGB image
from imblearn.over_sampling import SMOTE
    
sm = SMOTE(random_state=42)
    
train_rows=len(X_train)
X_train = X_train.reshape(train_rows,-1)
(80,30000)

X_train, y_train = sm.fit_resample(X_train, y_train)
X_train = X_train.reshape(-1,100,100,3)
(>80,100,100,3)

Hemanth Kollipara
  • 902
  • 11
  • 16