1. Tensorflow Datasets
Since TF2.0
it is advised to use tf.data
API with tf.keras
. tf.data.Dataset
as part of the aforementioned allows you to easily carry various operations on your data like image augmentation (e.g. rotation/shifting) via map
calls (you can find other possibilities in the documentation).
Tensorflow Datasets is a part of Tensorflow's ecosystem and allows you easier data downloading (various ready datasets are present, including Fashion MNIST, see here for available options) and getting it in tf.data.Dataset
form already.
Using this snippet:
import tensorflow_datasets as tfds
train, test = tfds.load("fashion_mnist", as_supervised=True)
Will download data and automatically split it into test
and train
(same as the Keras equivalent, except for data type).
You can create your own datasets builders, though usually call to tfds.load
will be enough for standard operations.
Custom splits
Now if you want different splitting (not the default 60000
train and 10000
test), you may define it using tfds.Split
object. By default, each of the provided datasets (so your Fashion MNIST is included) provides default tfds.Split.TRAIN
and tfds.Split.TEST
(some provide tfds.Split.VALID
as well).
Those default splits can be further divided into subparts in various ways:
Split one of TEST
or TRAIN
into N
parts. Code below will only download 30.000 images from TRAIN and 5.000 images from test:
import tensorflow_datasets as tfds
train_half_1, train_half_2 = tfds.Split.TRAIN.subsplit(2)
test1, test2, test3, test4 = tfds.Split.TEST.subsplit(4)
train_first_half = tfds.load("fashion-mnist", split=train_half_1)
test_second_quarter = tfds.load("fashion-mnist", split=test2)
In similar way you can take N
percents of each split:
first_10_percent = tfds.Split.TRAIN.subsplit(tfds.percent[:10])
Or you could even combine splits in order to get more data like this (you could than further split this data however you like:
train_and_test = tfds.Split.TRAIN + tfds.Split.TEST
2. Keras
Keras loads data in numpy
format and though it's not advised and does not allow one to perform many operations with simple map
you could split those using standard Python's slicing notation:
import tensorflow as tf
fashion_mnist = tf.keras.datasets.fashion_mnist
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
# First 10.000 elements from train
X_train_subset = X_train[:10000]
# Elements from 1000 to 5000 from test labels
y_test_subset = y_test[1000:5000]
# Elements from 8500 to the end of test data
X_test_subset = X_test[8500:]
On the other hand it might be much more convenient to work with numpy
arrays instead of tf.data.Dataset
for certain applications (especially more non-standard ones) so the choice is yours.