I have a dataset that comes in as a tf.data.Dataset
from the new tf.Datasets
module. Of course the tf.data.Dataset
is an iterator over examples, but I need to actually convert this iterator into a full tensor containing all of the data loaded into memory. I am working with textual data and in order to extract the vocabulary of the corpus for Tokenization, I actually need the entire corpus of text at once.
I can of course write a loop to do this, but I was wondering if there was a more vectorized or faster way to implement the same task. Thanks.
I can at least provide the beginnings of the code. Note I am using Tensorflow 2.0a to try and get ready for the changeover:
import tensorflow_datasets as tfds
# Download the data
imdb_builder = tfds.builder('imdb_reviews')
imdb_builder.download_and_prepare()
# Setup training test split
imdb_train = imdb_builder.as_dataset(split=tfds.Split.TRAIN)
imdb_test = imdb_builder.as_dataset(split=tfds.Split.TEST)
# Look at the specs on the dataset if you wish
# print(imdb_builder.info)
To look at a single example. Observe that the data is un-tokenized.
a, = imdb_train.take(1)
print(a['text'])
tf.Tensor(b"As a lifelong fan of Dickens, I have ...", shape=(), dtype=string)
This is where I got stuck. Note that when trying to create the iterator over this dataset I obtained an error:
iter = imdb_train.batch(10).repeat(1).make_one_shot_iterator()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-35-1bf70c474a05> in <module>()
----> 1 imdb_train = imdb_train.batch(10).repeat(1).make_one_shot_iterator()
AttributeError: 'RepeatDataset' object has no attribute 'make_one_shot_iterator'