6

I am new to tensorflow, and I have started to use tensorflow 2.0

I have built a tensorflow dataset for a multi-class classification problem. Let's call this labeled_ds. I have prepared this dataset by loading all the image files from their respective class wise directories. I have followed along the tutorial here : tensorflow guide to load image dataset

Now, I need to split labeld_ds into three disjoint pieces : train, validation and test. I was going through the tensorflow API, but there was no example which allows to specify the split percentages. I found something in the load method, but I am not sure how to use it. Further, how can I get splits to be stratified ?

# labeled_ds contains multi class data, which is unbalanced.
train_ds, val_ds, test_ds = tf.data.Dataset.tfds.load(labeled_ds, split=["train", "validation", "test"])

I am stuck here, would appreciate any advice on how to progress from here. Thanks in advance.

Swaroop
  • 1,219
  • 3
  • 16
  • 32
  • Refer [this](https://stackoverflow.com/a/51126863/11652623) answer to split `tf.data` dataset – Swapnil Masurekar Dec 01 '19 at 06:10
  • 1
    @SWAPNILMASUREKAR the solution provided to [there](https://stackoverflow.com/questions/51125266/how-do-i-split-tensorflow-datasets/51126863#51126863) will work for splitting data into multiple subsets. The problem is, the resulting splits will still not be **stratified**. – Swaroop Feb 06 '20 at 12:19
  • 1
    I came accross the same problem, and didn't seem to find a solution in tensorflow that makes sure the dataset is in fact stratified. The solution I ended up using is [this](https://github.com/keras-team/keras/issues/5862#issuecomment-408529762). It's a function that splits your dataset into subdirectories of train and validation - then you can create train and validation tensorflow datasets from each directory – ofir dubi Sep 29 '20 at 10:24
  • @ofirdubi thanks for sharing the link to the code. I too did something similar since TensorFlow does not provide such a functionality out of the box. – Swaroop Sep 29 '20 at 12:19

4 Answers4

2

Please refer below code to create train, test and validation splits using tensorflow dataset "oxford_flowers102"

!pip install tensorflow==2.0.0

import tensorflow as tf
print(tf.__version__)
import tensorflow_datasets as tfds

labeled_ds, summary = tfds.load('oxford_flowers102', split='train+test+validation', with_info=True)

labeled_all_length = [i for i,_ in enumerate(labeled_ds)][-1] + 1

train_size = int(0.8 * labeled_all_length)
val_test_size = int(0.1 * labeled_all_length)

df_train = labeled_ds.take(train_size)
df_test = labeled_ds.skip(train_size)
df_val = df_test.skip(val_test_size)
df_test = df_test.take(val_test_size)

df_train_length = [i for i,_ in enumerate(df_train)][-1] + 1
df_val_length = [i for i,_ in enumerate(df_val)][-1] + 1
df_test_length = [i for i,_ in enumerate(df_test)][-1] + 1

print('Original: ', labeled_all_length)
print('Train: ', df_train_length)
print('Validation :', df_val_length)
print('Test :', df_test_length)
bsquare
  • 943
  • 5
  • 10
  • 4
    The solution looks good, but this method of choosing training, test and Validation subsets does not ensure the data to be *stratified*. The term `stratified` means to have equal proportion of samples from all the classes (in all the three subsets). – Swaroop Mar 19 '20 at 14:35
1

I had the same problem

It depends on the dataset, most of which have a train and test set. In this case you can do the following (assuming 80-10-10 split):

splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True,
split=['train+test[:80]','train+test[80:90]', 'train+test[90:]'],
data_dir=filePath)
roschach
  • 8,390
  • 14
  • 74
  • 124
  • 1
    Thanks, Francesco, I was looking for a solution on a custom dataset. However, your solution will help others using TensorFlow provided datasets. – Swaroop May 16 '21 at 07:33
1

Importing tensorflow datasets :

import tensorflow_datasets as tfds

MNIST_info used to save the MNIST dataset once the MNIST dataset gets loaded:

MNIST_dataset, MNIST_info = tfds.load(name='MNIST', with_info= True, as_supervised= True)

Splitting the MNIST dataset into two parts, train and test dataset :

MNIST_train, MNIST_test = MNIST_dataset['train'],MNIST_dataset['test']

num_validation_samples=0.1*MNIST_info.splits['train'].num_examples
# (allocating 10 percent of the training dataset to create the validation dataset.)

Once validation dataset gets created, we can then have the samples convert to integer.

num_validation_samples = tf.cast(num_validation_samples, tf.int64)

Similarily, we have created the test samples in a similar way,

num_test_samples = MNIST_info.splits['test'].num_examples    
num_test_samples = tf.cast(num_test_samples, tf.int64)    
num_train_samples = 0.8*MNIST_info_splits['train'].num_examples

(allocating 80 percent out of the test dataset to create the training dataset.)

num_train_samples = tf.cast(num_train_samples, tf.int64)

Hope this has answered your question

Yash Mehta
  • 2,025
  • 3
  • 9
  • 20
0

Francesco Boi Soultion works good for me.

splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True, split=['train+test[:80]','train+test[80:90]', 'train+test[90:]'])

(train_examples, validation_examples, test_examples) = splits