I'm using MobileNet and TensorFlow 2 to distinguish between 4 fairly similar toys. I have exactly 750 images for each toy and one label that contains 750 'negative' images, without any of the toys.
I've used MobileNet before for this with a fair degree of success, but something about this case is causing a lot of overfitting (~30-40% discrepancy between training/validation accuracy). The model very quickly trains to a training accuracy of about 99.8% in 3 epochs but the validation accuracy is stuck around 75%. The validation data set is a random set of 20% of the input images. When looking at the accuracy of the model, there's a strong bias towards one of the toys with a lot of the other toys falsely identified as that toy.
I've tried pretty much everything out there to combat this:
I've added Dropout after the Conv2D layer that's added to the top of MobileNet and tried various dropout rates between 0.2 and 0.9.
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(label_count, activation='softmax')
])
I've added an additional Dropout layer before the Conv2D layer, which seemed to marginally improve things:
model = tf.keras.Sequential([
base_model,
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(label_count, activation='softmax')
])
I've also added more test data, trying mixtures of photographs of the toys in various lighting conditions and backgrounds and generated images of the toys superimposed over random backgrounds. None of these have a significant impact.
Should I be adding dropout to the MobileNet model rather than just to the layers that I'm adding after it? I came across this code on github that does this, but I've no idea if this is actually a good idea or not - or quite how to achieve this with TensorFlow 2.
Is this sensible, or feasible?
Alternatively, the only other ideas I can think of are:
- Capture more images, to make the training harder - but I'd have thought 750 for each item should be enough to do a pretty good job.
- Don't use MobileNet, create a neural network from scratch or use another pre-existing one.