Get length of a dataset in Tensorflow

Question

source_dataset = tf.data.TextLineDataset('primary.csv')
target_dataset = tf.data.TextLineDataset('secondary.csv')
dataset = tf.data.Dataset.zip((source_dataset, target_dataset))
dataset = dataset.shard(10000, 0)
dataset = dataset.map(lambda source, target: (tf.string_to_number(tf.string_split([source], delimiter=',').values, tf.int32),
                                              tf.string_to_number(tf.string_split([target], delimiter=',').values, tf.int32)))
dataset = dataset.map(lambda source, target: (source, tf.concat(([start_token], target), axis=0), tf.concat((target, [end_token]), axis=0)))
dataset = dataset.map(lambda source, target_in, target_out: (source, tf.size(source), target_in, target_out, tf.size(target_in)))

dataset = dataset.shuffle(NUM_SAMPLES)  #This is the important line of code

I would like to shuffle my entire dataset fully, but shuffle() requires a number of samples to pull, and tf.Size() does not work with tf.data.Dataset.

How can I shuffle properly?

It should be the size of your smaller csv file. I'm not aware of a function or property in Tensorflow that returns the length of the Dataset. — Lescurel, Dec 10 '17 at 07:56
From the [documentation](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#zip) : *The number of elements in the resulting dataset is the same as the size of the smallest dataset* — Lescurel, Dec 12 '17 at 10:11
zip() works the same way; iteration ends when StopIteration is raised (by the shortest object). — markemus, May 29 '19 at 21:30

score 2 · Accepted Answer · answered Feb 09 '18 at 14:54

I was working with tf.data.FixedLengthRecordDataset() and ran into a similar problem. In my case, I was trying to only take a certain percentage of the raw data. Since I knew all the records have a fixed length, a workaround for me was:

totalBytes = sum([os.path.getsize(os.path.join(filepath, filename)) for filename in os.listdir(filepath)])
numRecordsToTake = tf.cast(0.01 * percentage * totalBytes / bytesPerRecord, tf.int64)
dataset = tf.data.FixedLengthRecordDataset(filenames, recordBytes).take(numRecordsToTake)

In your case, my suggestion would be to count directly in python the number of records in 'primary.csv' and 'secondary.csv'. Alternatively, I think for your purpose, to set the buffer_size argument doesn't really require counting the files. According to the accepted answer about the meaning of buffer_size, a number that's greater than the number of elements in the dataset will ensure a uniform shuffle across the whole dataset. So just putting in a really big number (that you think will surpass the dataset size) should work.

How did you use shuffle & split for your dataset? (using totalBytes / bytesPerRecord) — JeeyCi, May 04 '22 at 16:09

Timbus Calin · Answer 2 · 2020-11-25T13:31:19.137

1

As of TensorFlow 2, the length of the dataset can be easily retrieved by means of the cardinality() function.

dataset = tf.data.Dataset.range(42)
#both print 42 
dataset_length_v1 = tf.data.experimental.cardinality(dataset).numpy())
dataset_length_v2 = dataset.cardinality().numpy()

NOTE: When using predicates, such as filter, the return of the length may be -2. One can consult an explanation here, otherwise just read the following paragraph:

If you use the filter predicate, the cardinality may return value -2, hence unknown; if you do use filter predicates on your dataset, ensure that you have calculated in another manner the length of your dataset( for example length of pandas dataframe before applying .from_tensor_slices() on it.

edited Nov 25 '20 at 13:31

answered Aug 19 '20 at 08:53

Timbus Calin

13,809
5
41
59

2

That gives -2 for both the datasets I have tried it on. – Toby Nov 25 '20 at 12:47
Yes, and here is the explanation why – Timbus Calin Nov 25 '20 at 13:28

Get length of a dataset in Tensorflow

2 Answers2