I'm trying to do a binary image classification problem, but the two classes (~590 and ~5900 instances, for class 1 and 2, respectively) are heavily skewed, but still quite distinct.
Is there any way I can fix this, I want to try SMOTE/random weighted oversampling.
I've tried a lot of different things but I'm stuck. I've tried using class_weights=[10,1]
,[5900,590]
, and [1/5900,1/590]
and my model still only predicts class 2.
I've tried using tf.data.experimental.sample_from_datasets
but I couldn't get it to work. I've even tried using sigmoid focal cross-entropy loss, which helped a lot but not enough.
I want to be able to oversample class 1 by a factor of 10, the only thing I have tried that has kinda worked is manually oversampling i.e. copying the train dir's class 1 instances to match the number of instances in class 2.
Is there not an easier way of doing this, I'm using Google Colab and so doing this is extremely inefficient.
Is there a way to specify SMOTE params / oversampling within the data generator or similar?
data/
...class_1/
........image_1.jpg
........image_2.jpg
...class_2/
........image_1.jpg
........image_2.jpg
My data is in the form shown above.
TRAIN_DATAGEN = ImageDataGenerator(rescale = 1./255.,
rotation_range = 40,
width_shift_range = 0.2,
height_shift_range = 0.2,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
TEST_DATAGEN = ImageDataGenerator(rescale = 1.0/255.)
TRAIN_GENERATOR = TRAIN_DATAGEN.flow_from_directory(directory = TRAIN_DIR,
batch_size = BACTH_SIZE,
class_mode = 'binary',
target_size = (IMG_HEIGHT, IMG_WIDTH),
subset = 'training',
seed = DATA_GENERATOR_SEED)
VALIDATION_GENERATOR = TEST_DATAGEN.flow_from_directory(directory = VALIDATION_DIR,
batch_size = BACTH_SIZE,
class_mode = 'binary',
target_size = (IMG_HEIGHT, IMG_WIDTH),
subset = 'validation',
seed = DATA_GENERATOR_SEED)
...
...
...
HISTORY = MODEL.fit(TRAIN_GENERATOR,
validation_data = VALIDATION_GENERATOR,
epochs = EPOCHS,
verbose = 2,
callbacks = [EARLY_STOPPING],
class_weight = CLASS_WEIGHT)
I'm relatively new to Tensorflow but I have some experience with ML as a whole. I've been tempted to switch to PyTorch several times as they have params for data loaders that automatically (over/under)sample with sampler=WeightedRandomSampler
.
Note: I've looked at many tutorials about how to oversample however none of them are image classification problems, I want to stick with TF/Keras as it allows for easy transfer learning, could you guys help out?