0

Issues with BertLMDataBunch.from_raw_corpus(), 'charmap' codec can't encode character '\u0627' in position 0: character maps to

When creating a BertLMDataBunch object, I got issue that 'charmap' codec can't encode character '\u0627' in position 0. When I tried to encode my texts using utf-8, I got this error : 'charmap' codec can't encode characters in position 20-25: character maps to I also thought about avoiding punctuation or special characters like 'éèêçàôûù' but I got the same error.

df_train is my labeled dataset, and Description is the column with french texts.

DATA_PATH = Path('./data/')

all_texts = df_train['Description'].to_list()
all_texts = [ (x.encode('utf-8', errors='ignore')).decode('utf-8', errors='ignore') for x in all_texts]

The texts also contain numbers

the BertLMDataBunch object

enter image description here

The object I created generates a text file lm_trained that contains texts like this :

Bonjour Le 21 Avril 2021 j ai envoy� une r�clamation

If anyone can help me to fix this. Thank you !

YAMADA
  • 1
  • 1
  • Please revise your question to include a minimal reproducible example that can demonstrate the bug. Present it as text, not screenshots. [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) – Ruan Jun 04 '23 at 16:10

0 Answers0