How to measure overfitting when train and validation sample is small in Keras model

Question

I have the following plot:

The model is created with the following number of samples:

                class1     class2
train             20         20
validate          21         13

In my understanding, the plot show there is no overfitting. But I think, since the sample is very small, I'm not confident if the model is general enough.

Is there any other way to measure overfittingness other than the above plot?

This is my complete code:

library(keras)
library(tidyverse)


train_dir <- "data/train/"
validation_dir <- "data/validate/"



# Making model ------------------------------------------------------------


conv_base <- application_vgg16(
  weights = "imagenet",
  include_top = FALSE,
  input_shape = c(150, 150, 3)
)

# VGG16 based model -------------------------------------------------------

# Works better with regularizer
model <- keras_model_sequential() %>%
  conv_base() %>%
  layer_flatten() %>%
  layer_dense(units = 256, activation = "relu", kernel_regularizer = regularizer_l1(l = 0.01)) %>%
  layer_dense(units = 1, activation = "sigmoid")

summary(model)

length(model$trainable_weights)
freeze_weights(conv_base)
length(model$trainable_weights)


# Train model -------------------------------------------------------------
desired_batch_size <- 20 

train_datagen <- image_data_generator(
  rescale = 1 / 255,
  rotation_range = 40,
  width_shift_range = 0.2,
  height_shift_range = 0.2,
  shear_range = 0.2,
  zoom_range = 0.2,
  horizontal_flip = TRUE,
  fill_mode = "nearest"
)

# Note that the validation data shouldn't be augmented!
test_datagen <- image_data_generator(rescale = 1 / 255)


train_generator <- flow_images_from_directory(
  train_dir, # Target directory
  train_datagen, # Data generator
  target_size = c(150, 150), # Resizes all images to 150 × 150
  shuffle = TRUE,
  seed = 1,
  batch_size = desired_batch_size, # was 20
  class_mode = "binary" # binary_crossentropy loss for binary labels
)

validation_generator <- flow_images_from_directory(
  validation_dir,
  test_datagen,
  target_size = c(150, 150),
  shuffle = TRUE,
  seed = 1,
  batch_size = desired_batch_size,
  class_mode = "binary"
)

# Fine tuning -------------------------------------------------------------

unfreeze_weights(conv_base, from = "block3_conv1")

# Compile model -----------------------------------------------------------



model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_rmsprop(lr = 2e-5),
  metrics = c("accuracy")
)


# Evaluate  by epochs  ---------------------------------------------------------------


#  # This create plots accuracy of various epochs (slow)
history <- model %>% fit_generator(
  train_generator,
  steps_per_epoch = 100,
  epochs = 15, # was 50
  validation_data = validation_generator,
  validation_steps = 50
)

plot(history)

Are you using the keras defaults from RStudio tutorial? https://tensorflow.rstudio.com/blog/keras-image-classification-on-small-datasets.html. It looks like it but just checking. It would be great if you can provide the sample data. My approach would be to train on a larger public data set and then analyse the small private data within that context. You could also try swapping the training and the validation data and compare the results. — Technophobe01, Feb 13 '18 at 02:24

score 10 · Answer 1 · answered Feb 13 '18 at 09:17

So two things here:

Stratify your data w.r.t. classes - your validation data has a completely different class distribution than your training set (train set is balanced whereas validation set - not). This might affect your losses and metrics values. It's better to stratify your results so the class ratio would be the same for both sets.
With a so few data points use more rough validation schemas - as you may see you have only 74 images in total. In this case - it's not a problem to load all images to numpy.array (you still could have data augmentation using flow function) and use validation schemas which are hard to obtain when you have your data in a folder. The schemas (from sklearn) which I advice you to use are:
- stratified k-fold cross-validation - where you divide your data into k chunks - and for each selection of k - 1 chunks - you first train your model on k - 1 and then compute metrics on the one which was left for validation. The final result is a mean out of results obtained on validation chunks. You could, of course, check not only mean but also other statistics of losses distribution (like e.g. min, max, median, etc.). You could also compare them with results obtained on a training set for each fold.
- leave-one-out - this is a special case of previous schema - where the number of chunks / folds is equal to the number of examples in your dataset. This method is considered as the roughest way of measuring your model performance. It's rarely used in deep learning because of the fact that training process is usually to slow and datasets are to big in order to accomplish computations in a reasonable time.

Another approach to consider, with very few data points for training, would be an extension of k-fold cross validation which is "iterated k-fold cross validation with shuffling": 1) shuffle the data, 2) perform a k-fold cross validation. Perform the steps 1 and 2 for N times; then to compute the final score, take the average of the scores obtained at each run of k-fold cross validation. This score would be much more precise in this case (i.e. low number of data points) than performing only one k-fold. However, it may be more computationally expensive since there would be K×N models learned. — today, Feb 17 '18 at 15:40

score 5 · Answer 2 · answered Feb 08 '18 at 04:28

I recommend looking at the predictions as the next step.

For example, judging from the top plot and the number of provided samples, your validation predictions fluctuates between two accuracies, and the difference between those predictions is exactly one sample guessed right.

So, your model predicts more or less the same results (plus-minus one observation) with no respect to the fitting. This is a bad sign.

Also, the number of features and trainable parameters (weights) is way too high for the provided number of samples. All those weights just have no chance to actually be trained.

score 5 · Answer 3 · answered Feb 16 '18 at 17:10

Your validation loss is constantly lower than the training loss. I would be quite suspicious of your results. If you look at the validation accuracy, it just shouldn't be like that.

The less data you have, the less confidence you can have in anything. So you are right when you are not sure about overfitting. The only thing that works here is to gather more data, either by data augmentation, or combining with another dataset.

score 4 · Answer 4 · answered Feb 18 '18 at 13:52

If you want to measure the overfitness of your current model you could test the model on your small test set, and each time select 34 samples from the validate set i.e. by the function sample with the setting replace=TRUE. By picking samples that you replace from your validate set, you will be able to create more "extreme" datasets and hence getting a better estimate of how much the prediction could vary based on your available data. This method is called bagging or bootstrap aggregating.

How to measure overfitting when train and validation sample is small in Keras model

4 Answers4

Linked