How to adapt TextVectorization layer on tf.Dataset

Question

I load my dataset like this:

self.train_ds = tf.data.experimental.make_csv_dataset(
            self.config["input_paths"]["data"]["train"],
            batch_size=self.params["batch_size"],
            shuffle=False,
            label_name="tags",
            num_epochs=1,
        )

My TextVectorization layer looks like this:

vectorizer = tf.keras.layers.TextVectorization(
            standardize=code_standaridization,
            split="whitespace",
            output_mode="int",
            output_sequence_length=params["input_dim"],
            max_tokens=100_000,
        )

And I thought this is going to be enough:

vectorizer.adapt(data_provider.train_ds)

But its not, I have this error:

TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None, None), dtype=string) of type 'Tensor'.

Can I somehow adapt my vectorizer on TensorFlow dataset?

score 0 · Answer 1 · answered Jan 18 '23 at 08:21

Most probably the issue is that you use batch_size in your train_ds without .unbatch() when you try to adapt.

You have to do:

vectorizer.adapt(train_ds.unbatch().map(lambda x, y: x).batch(BATCH_SIZE))

The .unbatch() solves the error that you are currently seeing and the .map() is needed because the TextVectorization layer operates on batches of strings so you need to get them from your dataset

How to adapt TextVectorization layer on tf.Dataset

1 Answers1