In attempting to adapt a text vectorization layer to a UTF-8 encoded vocabulary:
encoder = tf.keras.layers.TextVectorization(
max_tokens=VOCAB_SIZE,
standardize="lower",
)
encoder.adapt(train.map(
lambda doc, label : doc
))
The following error occurs:
UnicodeEncodeError: 'ascii' codec can't encode character '\u2122' in position 49: ordinal not in range(128)
The input is encoded as UTF-8. According to my understanding, this should be compatible with Tensorflow.
Environment:
Tensorflow version: 2.10.0
Python version: 3.10.8
Platform: Linux
Full traceback:
InvalidArgumentError Traceback (most recent call last)
Cell In [10], line 8
1 VOCAB_SIZE = 5000
3 encoder = tf.keras.layers.TextVectorization(
4 max_tokens=VOCAB_SIZE,
5 standardize="lower",
6 )
----> 8 encoder.adapt(train.map(
9 lambda doc, label : doc
10 ))
File ~/.local/lib/python3.10/site-packages/keras/layers/preprocessing/text_vectorization.py:467, in TextVectorization.adapt(self, data, batch_size, steps)
417 def adapt(self, data, batch_size=None, steps=None):
418 """Computes a vocabulary of string terms from tokens in a dataset.
419
420 Calling `adapt()` on a `TextVectorization` layer is an alternative to
(...)
465 argument is not supported with array inputs.
466 """
--> 467 super().adapt(data, batch_size=batch_size, steps=steps)
File ~/.local/lib/python3.10/site-packages/keras/engine/base_preprocessing_layer.py:258, in PreprocessingLayer.adapt(self, data, batch_size, steps)
256 with data_handler.catch_stop_iteration():
257 for _ in data_handler.steps():
--> 258 self._adapt_function(iterator)
259 if data_handler.should_sync:
260 context.async_wait()
File ~/.local/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
151 except Exception as e:
152 filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153 raise e.with_traceback(filtered_tb) from None
154 finally:
155 del filtered_tb
File ~/.local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
52 try:
53 ctx.ensure_initialized()
---> 54 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
55 inputs, attrs, num_outputs)
56 except core._NotOkStatusException as e:
57 if name is not None:
InvalidArgumentError: Graph execution error:
2 root error(s) found.
(0) INVALID_ARGUMENT: UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
Traceback (most recent call last):
File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in __call__
return [self._convert(x) for x in ret]
File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in <listcomp>
return [self._convert(x) for x in ret]
File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 237, in _convert
return result.astype(np.bytes_)
UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
[[{{node PyFunc}}]]
[[IteratorGetNext]]
[[UniqueWithCounts/_6]]
(1) INVALID_ARGUMENT: UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
Traceback (most recent call last):
File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in __call__
return [self._convert(x) for x in ret]
File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in <listcomp>
return [self._convert(x) for x in ret]
File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 237, in _convert
return result.astype(np.bytes_)
UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
[[{{node PyFunc}}]]
[[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_adapt_step_195]