Use SMOTE to oversample image data

Question

I'm doing a binary classification with CNNs and the data is imbalanced where the positive medical image : negative medical image = 0.4 : 0.6. So I want to use SMOTE to oversample the positive medical image data before training. However, the dimension of the data is 4D (761,64,64,3) which cause the error

Found array with dim 4. Estimator expected <= 2

So, I reshape my train_data:

X_res, y_res = smote.fit_sample(X_train.reshape(X_train.shape[0], -1), y_train.ravel())

And it works fine. Before feed it to CNNs, I reshape it back by:

X_res = X_res.reshape(X_res.shape[0], 64, 64, 3)

Now, I'm not sure is it a correct way to oversample and will the reshape operator change the images' structer?

Hello, did you make test to check if ur shape/reshape operation is working well? — akhetos, May 07 '19 at 12:25

score 7 · Answer 1 · answered Nov 21 '19 at 03:51

7

I had a similar issue. I had used the reshape function to reshape the image (basically flattened the image)

X_train.shape
(8000, 250, 250, 3)

ReX_train = X_train.reshape(8000, 250 * 250 * 3)
ReX_train.shape
(8000, 187500)

smt = SMOTE()
Xs_train, ys_train = smt.fit_sample(ReX_train, y_train)

Although, this approach is pathetically slow, but helped to improve the performance.

answered Nov 21 '19 at 03:51

Aditya Bhattacharya

914
2
9
22

Also, just as an extension to the previous question. Incase if you are using a pretrained network, and the feature extraction approach, you won't need to reshape back the flattened image. But if not, the image would again have to be reshaped back to the original dimension. Also, there is an article about SMOTE that can help: https://medium.com/@adib0073/how-to-use-smote-for-dealing-with-imbalanced-image-dataset-for-solving-classification-problems-3aba7d2b9cad – Aditya Bhattacharya Feb 02 '20 at 11:09

score 3 · Answer 2 · answered Nov 21 '19 at 04:12

3

As soon as you flatten an image you are loosing localized information, this is one of the reasons why convolutions are used in image-based machine learning.
8000x250x250x3 has an inherent meaning - 8000 samples of images, each image of width 250, height 250 and all of them have 3 channels when you do 8000x250*250*3 reshape is just a bunch of numbers unless you use some kind of sequence network to teach its bad.
oversampling is bad for image data, you can do image augmentations (20crop, introducing noise like a gaussian blur, rotations, translations, etc..)

answered Nov 21 '19 at 04:12

cerofrais

1,117
1
12
32

Do you have any quotes to prove your third idea? About "oversampling is bad for image data"? – CuCaRot Jun 18 '20 at 04:14
1

when i said "oversampling is bad" i meant if we try to just duplicate the number of low-class images, we are not trying to generalize and it will see the same image multiple times which in some cases leads to overfitting. sorry about an uncertain statement again – cerofrais Jun 18 '20 at 08:53
Just want to debate about it. So besides augment, do you know any method to deal with imbalanced problem? I am facing with a dataset with major - minor is 98 - 2 which mean only 2% in that dataset is a class – CuCaRot Jun 18 '20 at 09:00
1

Assuming image data, i would suggest k-fold (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) + image argumentations (https://github.com/albumentations-team/albumentations) still unless you are doing anomaly detection 98-2 is really bad split. – cerofrais Jun 18 '20 at 09:03
I think oversampling then only augment the minor class is a good idea (https://medium.com/swlh/how-to-use-smote-for-dealing-with-imbalanced-image-dataset-for-solving-classification-problems-3aba7d2b9cad) – CuCaRot Jun 18 '20 at 11:50

score 2 · Answer 3 · answered Sep 18 '21 at 12:51

First Flatten the image
Apply SMOTE on this flattened image data and its labels
Reshape the flattened image to RGB image

from imblearn.over_sampling import SMOTE
    
sm = SMOTE(random_state=42)
    
train_rows=len(X_train)
X_train = X_train.reshape(train_rows,-1)
(80,30000)

X_train, y_train = sm.fit_resample(X_train, y_train)
X_train = X_train.reshape(-1,100,100,3)
(>80,100,100,3)

Use SMOTE to oversample image data

3 Answers3

Linked