1

In attempting to adapt a text vectorization layer to a UTF-8 encoded vocabulary:

encoder = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    standardize="lower",
)

encoder.adapt(train.map(
    lambda doc, label : doc
))

The following error occurs:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2122' in position 49: ordinal not in range(128)

The input is encoded as UTF-8. According to my understanding, this should be compatible with Tensorflow.

Environment:

Tensorflow version: 2.10.0

Python version: 3.10.8

Platform: Linux

Full traceback:

InvalidArgumentError                      Traceback (most recent call last)
Cell In [10], line 8
      1 VOCAB_SIZE = 5000
      3 encoder = tf.keras.layers.TextVectorization(
      4     max_tokens=VOCAB_SIZE,
      5     standardize="lower",
      6 )
----> 8 encoder.adapt(train.map(
      9     lambda doc, label : doc
     10 ))

File ~/.local/lib/python3.10/site-packages/keras/layers/preprocessing/text_vectorization.py:467, in TextVectorization.adapt(self, data, batch_size, steps)
    417 def adapt(self, data, batch_size=None, steps=None):
    418     """Computes a vocabulary of string terms from tokens in a dataset.
    419 
    420     Calling `adapt()` on a `TextVectorization` layer is an alternative to
   (...)
    465           argument is not supported with array inputs.
    466     """
--> 467     super().adapt(data, batch_size=batch_size, steps=steps)

File ~/.local/lib/python3.10/site-packages/keras/engine/base_preprocessing_layer.py:258, in PreprocessingLayer.adapt(self, data, batch_size, steps)
    256 with data_handler.catch_stop_iteration():
    257     for _ in data_handler.steps():
--> 258         self._adapt_function(iterator)
    259         if data_handler.should_sync:
    260             context.async_wait()

File ~/.local/lib/python3.10/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File ~/.local/lib/python3.10/site-packages/tensorflow/python/eager/execute.py:54, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     52 try:
     53   ctx.ensure_initialized()
---> 54   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
     55                                       inputs, attrs, num_outputs)
     56 except core._NotOkStatusException as e:
     57   if name is not None:

InvalidArgumentError: Graph execution error:

2 root error(s) found.
  (0) INVALID_ARGUMENT:  UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
Traceback (most recent call last):

  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in __call__
    return [self._convert(x) for x in ret]

  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in <listcomp>
    return [self._convert(x) for x in ret]

  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 237, in _convert
    return result.astype(np.bytes_)

UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)


     [[{{node PyFunc}}]]
     [[IteratorGetNext]]
     [[UniqueWithCounts/_6]]
  (1) INVALID_ARGUMENT:  UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)
Traceback (most recent call last):

  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in __call__
    return [self._convert(x) for x in ret]

  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 279, in <listcomp>
    return [self._convert(x) for x in ret]

  File "/home/moss/.local/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 237, in _convert
    return result.astype(np.bytes_)

UnicodeEncodeError: 'ascii' codec can't encode character '\xae' in position 401: ordinal not in range(128)


     [[{{node PyFunc}}]]
     [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_adapt_step_195]
  • Looks like some of your text data contains unicode string which need to be decoded. Please check this [link](https://www.tensorflow.org/text/guide/unicode#converting_between_representations) for how you can convert special characters into vector. You can take a look at this similar [issue](https://stackoverflow.com/questions/48347331/python-unicodeencodeerror-ascii-codec-cant-encode-character-in-position-0-o) for your reference. –  Dec 07 '22 at 13:18

0 Answers0