I have a text that I extracted from an image using Tesseract. When I try to print it in the terminal, I get this error 'ascii' codec can't encode character '\xc7' in position 10: ordinal not in range(128)
in case of special characters (é, è, à, ç ...)
When I write the extracted text to a file, I get the correct text including the special characters!
Here is the code I used:
# -*- coding: utf-8 -*-
import cv2
import pytesseract
with open ('path_to_text_file', 'w', encoding='utf-8') as f:
try:
im = cv2.imread(path_to_image)
text = pytesseract.image_to_string(im, lang='fra')
f.write(text + '\n')
print(text)
except Exception as e:
print(e)
f.close()
I also tried print(str(text))
instead of print(text)
but nothing changed!
In case it is helpful, when I print the type of the variable text
(print(type(text))
), I get <class 'str'>
.
Any ideas how to fix this error?
EDIT:
Example of the files I am dealing with (Don't worry about confidentiality, this example is from the internet)
I use Ubuntu 18.04, python 3.6
The project that I run is on docker.
EDIT2:
Output displayed in the terminal:
'ascii' codec can't encode character '\xc9' in position 1: ordinal not in range(128)
'ascii' codec can't encode character '\xc9' in position 12: ordinal not in range(128)
'ascii' codec can't encode character '\xe9' in position 10: ordinal not in range(128)
30 | Noms BERTHIER
'ascii' codec can't encode character '\xe9' in position 2: ordinal not in range(128)
'ascii' codec can't encode character '\u2026' in position 0: ordinal not in range(128)
Sexe
Sexe: L N
3: PARIS 1ER (75)
ETES
Taie : 170
Cruise Her
| Signature
Le pol
du titulaire :
IDFRABERTHIFR<<EK<KEKKKELELEREREELEREE
88069231028S8CORINNE<<<<<<<6512068F6
Output written to the text file:
RÉPUBLIQUE FRANÇAI
RE
D'IDENTITÉ Ne : 880692310285
Nationalité Française
30 | Noms BERTHIER
Prénoms): CORINNE
… Néfel le: 06.12.1985
Sexe
Sexe: L N
3: PARIS 1ER (75)
ETES
Taie : 170
Cruise Her
| Signature
Le pol
du titulaire :
IDFRABERTHIFR<
88069231028S8CORINNE<<<<<<<6512068F6
EDIT3:
If I remove encoding='utf-8'
from with open(filename, 'w') ..
I only get the normal characters; every line where there are special characters is not written to the file anymore.
python i/o encoding is utf-8
the output of locale -a is C C.UTF-8 POSIX