4

The Fashion-MNIST dataset automatically returns 60,000 images for training and 10,000 images for evaluation. How do I change those numbers?

Here is my colab source code and the relevant part is:

fashion_mnist = keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
ronm333
  • 131
  • 1
  • 9
  • You can do it inside tensorflow data api pipeline, or simply concatenate numpy arrays and then split them in any proportion you need – Sharky Apr 28 '19 at 21:51

1 Answers1

4

1. Tensorflow Datasets

Since TF2.0 it is advised to use tf.data API with tf.keras. tf.data.Dataset as part of the aforementioned allows you to easily carry various operations on your data like image augmentation (e.g. rotation/shifting) via map calls (you can find other possibilities in the documentation).

Tensorflow Datasets is a part of Tensorflow's ecosystem and allows you easier data downloading (various ready datasets are present, including Fashion MNIST, see here for available options) and getting it in tf.data.Dataset form already.

Using this snippet:

import tensorflow_datasets as tfds

train, test = tfds.load("fashion_mnist", as_supervised=True)

Will download data and automatically split it into test and train (same as the Keras equivalent, except for data type).

You can create your own datasets builders, though usually call to tfds.load will be enough for standard operations.

Custom splits

Now if you want different splitting (not the default 60000 train and 10000 test), you may define it using tfds.Split object. By default, each of the provided datasets (so your Fashion MNIST is included) provides default tfds.Split.TRAIN and tfds.Split.TEST (some provide tfds.Split.VALID as well).

Those default splits can be further divided into subparts in various ways:

Split one of TEST or TRAIN into N parts. Code below will only download 30.000 images from TRAIN and 5.000 images from test:

import tensorflow_datasets as tfds

train_half_1, train_half_2 = tfds.Split.TRAIN.subsplit(2)
test1, test2, test3, test4 = tfds.Split.TEST.subsplit(4)

train_first_half = tfds.load("fashion-mnist", split=train_half_1)
test_second_quarter = tfds.load("fashion-mnist", split=test2)

In similar way you can take N percents of each split:

first_10_percent = tfds.Split.TRAIN.subsplit(tfds.percent[:10])

Or you could even combine splits in order to get more data like this (you could than further split this data however you like:

train_and_test = tfds.Split.TRAIN + tfds.Split.TEST

2. Keras

Keras loads data in numpy format and though it's not advised and does not allow one to perform many operations with simple map you could split those using standard Python's slicing notation:

import tensorflow as tf

fashion_mnist = tf.keras.datasets.fashion_mnist

(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# First 10.000 elements from train
X_train_subset = X_train[:10000]

# Elements from 1000 to 5000 from test labels
y_test_subset = y_test[1000:5000]

# Elements from 8500 to the end of test data
X_test_subset = X_test[8500:]

On the other hand it might be much more convenient to work with numpy arrays instead of tf.data.Dataset for certain applications (especially more non-standard ones) so the choice is yours.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
  • Here is some code to do this: (train_images, train_labels), (test_images, test_labels) = datasets.mnist.load_data() train_images = train_images.reshape((60000, 28, 28, 1)) test_images = test_images.reshape((10000, 28, 28, 1)) – ronm333 Apr 30 '19 at 01:53