How to split a tensorflow dataset into train, test and validation in a Python script?

Question

On a jupyter notebook with Tensorflow-2.0.0, a train-validation-test split of 80-10-10 was performed in this way:

import tensorflow_datasets as tfds
from os import getcwd
splits = tfds.Split.ALL.subsplit(weighted=(80, 10, 10))

filePath = f"{getcwd()}/../tmp2/"
splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True, split=splits, data_dir=filePath)

However, when trying to run the same code locally I get the error

AttributeError: type object 'Split' has no attribute 'ALL'

I have seen I can create two sets in this way:

splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True, split=['train[:80]','test[80:90]'], data_dir=filePath)

but I do not know how I can add a third set.

roschach · Answer 1 · 2021-10-15T08:25:31.400

tfds.Split.ALL.subsplit or tfds.Split.TRAIN.subsplit apparently are deprecated and no longer supported.

Some of the datasets are already split between train and test. In this case I found the following solution (using for example the fashion MNIST dataset):

splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True, split=['train+test[:80]','train+test[80:90]', 'train+test[90:]'], data_dir=filePath) (train_examples, validation_examples, test_examples) = splits

EDIT AFTER COMMENTS

The previous code had some errors. First of all, this official link says:

Full dataset ('all'): 'all' is a special split name corresponding to the union of all splits (equivalent to 'train+test+...')

but when I tried it did not work. all would be helpful but there is an alternative. The error in the previous code is that the % must be used and that it must be specified for each set. I modified the code in this way:

import tensorflow_datasets as tfds
splits, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True,
split=['train[:80%]+test[:80%]','train[80%:90%]+test[80%:90%]', 'train[90%:]+test[90%:]'],
data_dir='./')
#(train_examples, validation_examples, test_examples) = splits

for el in splits:
    print(el.cardinality())

which prints:

tf.Tensor(56000, shape=(), dtype=int64)
tf.Tensor(7000, shape=(), dtype=int64)
tf.Tensor(7000, shape=(), dtype=int64)

This doesn't really work. 'train+test[:80]' for example takes the 100% of train and the 80% of test; it does't take 80% of the combined train + test. — Dinuz, Oct 14 '21 at 21:32
Moreover if you don't add % you are not going to use percentage split (so in your case you are taking the whole train example plus the first 80 example in the test as train example. — Dinuz, Oct 14 '21 at 21:40
I removed the downvote, and I upvoted you:) Thank you for addressing the issue. I can confirm that the 'all' special split name doesn't work (I checked the github code, and they revoked the push request). So for now I guess the flag 'all' it's only present in the documentation but not in the code itself. — Dinuz, Oct 15 '21 at 18:20

score -1 · Answer 2 · answered May 24 '21 at 19:57

-1

In the case of rock_paper_scissor dataset on tfds it works for me:

splits = ['train+test[:80]', 'train+test[80:90]', 'train+test[90:]']

splits, info = tfds.load( 'rock_paper_scissors', split=splits, as_supervised=True, with_info=True)

(train_examples, validation_examples, test_examples) = splits

num_examples = info.splits['train'].num_examples
num_classes = info.features['label'].num_classes

answered May 24 '21 at 19:57

Mandi

1

@Francesco Boi please check this answer. – pullidea-dev May 24 '21 at 23:19
@Nikita I can't see what additional information this answer has compared to my older one. – roschach May 31 '21 at 04:24

How to split a tensorflow dataset into train, test and validation in a Python script?

2 Answers2

Linked