I have an image dataset of 12 classes. I'm comparing some different techniques of upsampling for image classication and wanted to upsample the classes that are massively imbalanced using SMOTE.
I defined by training set as seen below:
img_height = 256
img_width = 256
batch_size = 32
train_ds = tf.keras.utils.image_dataset_from_directory(
image_directory,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
Output : "Found 11730 files belonging to 12 classes. Using 9384 files for training."
The problem is, from my understanding, is that in order to upsample with SMOTE, I have to actually have a defined x_train & y_train. I attempted to do so with
for images, labels in train_ds: # only take first element of dataset
numpy_images = images.numpy()
numpy_labels = labels.numpy()
numpy_images.shape
The shape returns (8,256, 256, 3). This "8" feels out of place. I imagined that the shape would be the length of my training set, but maybe I'm misunderstanding.
I re-shaped the numpy array to make it 2D to be passed through SMOTE
x_train = numpy_images.reshape(8, 256 * 256 * 3)
x_train.shape
The shape returns (8, 196608)
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_smote, y_smote = SMOTE(k_neighbors=2).fit_resample(x_train, numpy_labels)
Finally, the above returns the error that n_samples = 1, n_neighbors = 2.
My questions are
- How do I resolve this issue?
- Why would the shape be (8, 256, 256, 3) instead of (9384, 256, 256, 3)?
Any help is appreciated!
SMOTE initialisation expects n_neighbors <= n_samples, but n_samples < n_neighbors
^This forum was extremely helpful but I still can't explain the shape of my numpy array.
My solution: I ended up just duplicating a specified number of images for each of the underrepresented categories. I did the preceding in File Explorer.