How to select a minimum validation dataset that represents all variance?

Question

I have a dataset of 2000 256 x 256 x 3 images to train a CNN model (with approximately 30 million trainable parameters) for pixel wise binary classification. Before training it I would like to divide it into validation and test set. Now, I have gone thru all the answers from this question.

The suggestions are like 80-20 split or random splits with training and observing performance (hit and trial type). So, my question - Is there a way/technique to choose a minimum dataset for validation and testing that represents all the variance of the total dataset? I am having an intuition like there should be a quantity (like mean) that can be measured per image and that quantity can be plotted so that some values become outliers, some do not and I can take images from these groups so that maximum representation of this variety is there.

I have a minimum dataset constraint as I have very less data. Augmentation might be suggested but then should I go for positional or intensity based augmentations. Because from class notes, our teacher told us that max pooling layer makes the network invariant to translations/rotations. So, I am assuming positional augmentations won't do.

score 0 · Accepted Answer · answered Jan 05 '21 at 14:35

Tried and refused:

1. Feature detectors and descriptors - Detectors won't be a good approximation of an image and descriptors were long vectors. I discarded this because at this time I did not know about the desired solution. This can be rethinked upon.

2. Autoencoders - idea was to train an autoencoder, plot the encoded values in 3d space (keep encoding dimensions as 3) and check for clusters and split data from each cluster into train, test, val. This did not work as training loss did not decrease beyond a point, given the limited GPU memory of Google Colab.

What seems to be working:

Dimensionality reduction techniques: Idea is flatten an image, fit the reduction model and transform the images into lower dimension to plot and organize them. I found that t-SNE (Neighbor graph methods) work better than PCA (Matrix factorization methods). Hence, I chose UMap (a Neighbor graph technique).

MWE:

import umap

X=np.load(path+'/X_train.npy') # Your image dataset, shape=(number of images,height,width, channels)

# an image of size h x w x c is actually a flattened array of size h*w*c
X_reshaped=np.zeros((X.shape[0],X.shape[1]*X.shape[2]*X.shape[3]))
for i in range(0,X.shape[0]):
  X_reshaped[i,:]=X[i,:,:,:].flatten()
del X
X_reshaped=X_reshaped/255
X_reshaped.shape

reducer=umap.UMAP(n_components=3)

reducer.fit(X=X_reshaped)

embedding=reducer.transform(X_reshaped)
embedding.shape

# Clustering
from sklearn import cluster
kmeans=cluster.KMeans(n_clusters=4,random_state=42).fit(embedding)

import plotly

data = [plotly.graph_objs.Scatter3d(x=embedding[:,0], 
                                  y=embedding[:,1],
                                  z=embedding[:,2], 
                                  mode='markers',     
                                  marker=dict(color=kmeans.labels_)
                                  )
       ]
plotly.offline.iplot(data)

Resulting plot:

And from this plot outliers can be observed, manual splitting can be done from clusters, although what should be the cluster size is a different question. :)

How to select a minimum validation dataset that represents all variance?

1 Answers1