Create Train Test Split on PrefetchDataset

Question

I have a directory which has 12 csv files. I read them using tensorflow using the following code:-

import tensorflow as tf

a = [0, 2, 3, 4, 5, 19, 23, 32, 39, 40, 42, 50, 51, 53, 56, 65, 66, 67, 68, 69]
data = tf.data.experimental.make_csv_dataset("./raw/*",
                                         batch_size=2000,
                                         select_columns = a,
                                         label_name="Cancelled",
                                         num_epochs = 30,
                                         num_parallel_reads=2)

How can I split this dataset into training and testing datasets?

I am quite new to tensorflow and have no idea how to work with prefetched datasets.

score 0 · Answer 1 · answered Nov 03 '22 at 09:35

0

You can use:

train_size = int(0.7 * DATASET_SIZE)
test_size  = int(0.3 * DATASET_SIZE)

train_dataset = data.skip(train_size)
test_dataset  = data.take(train_size)

Test dataset has first (0.3 * DATASET_SIZE) elements and the rest goes for training.

Take: Creates a Dataset with at most count elements from this dataset.

Skip: Creates a Dataset that skips count elements from this dataset.

Most of the answers here use take() and skip(), which requires knowing the size of your dataset before hand.

answered Nov 03 '22 at 09:35

Will

1,619
5
23

Taken from https://stackoverflow.com/questions/48213766/split-a-dataset-created-by-tensorflow-dataset-api-in-to-train-and-test and https://stackoverflow.com/questions/51125266/how-do-i-split-tensorflow-datasets/58452268#58452268 – AloneTogether Nov 03 '22 at 09:44
But what if I don't know the size of my dataset. – Shawn Brar Nov 03 '22 at 09:47

Create Train Test Split on PrefetchDataset

1 Answers1