2

I have a dataset that comes in as a tf.data.Dataset from the new tf.Datasets module. Of course the tf.data.Dataset is an iterator over examples, but I need to actually convert this iterator into a full tensor containing all of the data loaded into memory. I am working with textual data and in order to extract the vocabulary of the corpus for Tokenization, I actually need the entire corpus of text at once.

I can of course write a loop to do this, but I was wondering if there was a more vectorized or faster way to implement the same task. Thanks.

I can at least provide the beginnings of the code. Note I am using Tensorflow 2.0a to try and get ready for the changeover:

import tensorflow_datasets as tfds

# Download the data
imdb_builder = tfds.builder('imdb_reviews')
imdb_builder.download_and_prepare()

# Setup training test split
imdb_train = imdb_builder.as_dataset(split=tfds.Split.TRAIN)
imdb_test = imdb_builder.as_dataset(split=tfds.Split.TEST)

# Look at the specs on the dataset if you wish
# print(imdb_builder.info)

To look at a single example. Observe that the data is un-tokenized.

a, = imdb_train.take(1)
print(a['text'])

tf.Tensor(b"As a lifelong fan of Dickens, I have ...", shape=(), dtype=string)

This is where I got stuck. Note that when trying to create the iterator over this dataset I obtained an error:

iter = imdb_train.batch(10).repeat(1).make_one_shot_iterator()

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-35-1bf70c474a05> in <module>()
----> 1 imdb_train = imdb_train.batch(10).repeat(1).make_one_shot_iterator()

AttributeError: 'RepeatDataset' object has no attribute 'make_one_shot_iterator'
joel
  • 6,359
  • 2
  • 30
  • 55
krishnab
  • 9,270
  • 12
  • 66
  • 123
  • You don't need to load your whole corpus at once (and you really shouldn't). You could use `tf.data.Dataset` and `.map` with custom functor similar to [this answer](https://stackoverflow.com/questions/55421290/tensorflow-2-0-keras-how-to-write-image-summaries-for-tensorboard) (it's not like I'm plugging my answer there or anything, you know). `Tensorflow Datasets` library with it's [Tokenizer](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/Tokenizer) could be helpful here as well. Vocabulary gatherer should be quite fine to code with both those concepts. – Szymon Maszke Apr 25 '19 at 22:02
  • @SzymonMaszke thanks for the comment. Yes, you picked up on of the frustrations here. So in order to extract the full vocabulary, I would need to iterate over the entire dataset--so that I know I obtained all possible words. Here is an example using Tokenizer--see the accepted answer. https://stackoverflow.com/questions/51123481/how-to-build-a-language-model-using-lstm-that-assigns-probability-of-occurence-f/51126064#51126064 – krishnab Apr 25 '19 at 22:06
  • This one is a different thing, I meant more modern and readable approach than the one you linked. And iterating via `map` as `tf.data.Dataset` is supposed to work. Although I'm really unsure whether `tensorflow` and it's `tf.data.Dataset` abstraction would be convenient to work with in the long run in the NLP case (e.g. lemmatizing). If I were you I would change the framework (if you can ofc). If you could provide me with sample `tf.data.Dataset` I might work something out though. – Szymon Maszke Apr 25 '19 at 22:10
  • 1
    I added some code--at least as far as I got. The new Tensorflow Datasets don't seem to follow the `tf.data.Dataset` spec 100%, so for some reason I could not create the iterator. Not sure if this is a Tensorflow 2.0 issue or such? The idea is to iterate over this and identify the words and assign them to ids. There is a tf.lookup function for assigning a string to an id: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/lookup/StaticVocabularyTable – krishnab Apr 25 '19 at 22:18
  • That's most succinct I can come up with in a reasonable amount of time. Was thinking about using `map` and pipelining this (and other) operations you might encounter along the way, but it seems `Tensorflow` doesn't really have the tools (tbh it's not focused on text processing). – Szymon Maszke Apr 25 '19 at 22:53
  • 1
    Oh thanks so much @SzymonMaszke . Yeah, I generally work with images myself, but was trying to explore text a bit as well. Thanks so much for your effort. I can at least get started with what you were able to produce and go from there. – krishnab Apr 25 '19 at 22:55

1 Answers1

2

1. Data Loading

Using tfds.load is simpler and more compact:

import tensorflow_datasets as tfds

train = tfds.load("imdb_reviews", as_supervised=True, split=tfds.Split.TRAIN)

2. Vocabulary saver

Pretty simple, you may want to start indexing from zero.

class Tokenizer:
    def __init__(self):
        self.vocab = {}
        self._counter: int = 1
        self.tokenizer = tfds.features.text.Tokenizer()

    def __call__(self, text):
        # Haven't found anything working with tf.tensor, oh sweet irony
        tokens = self.tokenizer.tokenize(text.numpy())
        for token in tokens:
            if not token in self.vocab:
                self.vocab[token] = self._counter
                self._counter += 1

TBH it's a shame there is no tokenizer-like utility for plain tensors and I need to convert them like that, but oh well, it's still in the alpha stage.

3. Tokenize your data

Since TF2.0 and it's eager mode you can iterate with one_shot_iterator and other strange ideas comfortably using loop:

tokenizer = Tokenizer()

for text, _ in train:
    tokenizer(text)

Important: You don't have to load everything into the memory as it's an iterator. Though you may encounter problems with memory in vocab for really large corpuses.

4. Results

Printing items and their indices:

print(list(tokenizer.vocab.keys())[:10])
print(list(tokenizer.vocab.values())[:10])

Gives us:

['This', 'was', 'soul', 'provoking', 'I', 'am', 'an', 'Iranian', 'and', 'living']
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83