1

I have a text that I extracted from an image using Tesseract. When I try to print it in the terminal, I get this error 'ascii' codec can't encode character '\xc7' in position 10: ordinal not in range(128) in case of special characters (é, è, à, ç ...) When I write the extracted text to a file, I get the correct text including the special characters!
Here is the code I used:

# -*- coding: utf-8 -*-
import cv2
import pytesseract
with open ('path_to_text_file', 'w', encoding='utf-8') as f:
    try:
        im = cv2.imread(path_to_image)
        text = pytesseract.image_to_string(im, lang='fra')
        f.write(text + '\n')
        print(text)
    except Exception as e:
        print(e)
f.close()

I also tried print(str(text)) instead of print(text) but nothing changed!
In case it is helpful, when I print the type of the variable text (print(type(text))), I get <class 'str'>. Any ideas how to fix this error?

EDIT:

Example of the files I am dealing with (Don't worry about confidentiality, this example is from the internet)
enter image description here

I use Ubuntu 18.04, python 3.6
The project that I run is on docker.

EDIT2:
Output displayed in the terminal:

'ascii' codec can't encode character '\xc9' in position 1: ordinal not in range(128)
'ascii' codec can't encode character '\xc9' in position 12: ordinal not in range(128)
'ascii' codec can't encode character '\xe9' in position 10: ordinal not in range(128)
30 | Noms BERTHIER
'ascii' codec can't encode character '\xe9' in position 2: ordinal not in range(128)
'ascii' codec can't encode character '\u2026' in position 0: ordinal not in range(128)
Sexe
Sexe: L N
3: PARIS 1ER (75)
ETES
Taie : 170
Cruise Her
| Signature
Le pol
du titulaire :
IDFRABERTHIFR<<EK<KEKKKELELEREREELEREE
88069231028S8CORINNE<<<<<<<6512068F6  

Output written to the text file:

RÉPUBLIQUE FRANÇAI
RE
D'IDENTITÉ Ne : 880692310285
Nationalité Française
30 | Noms BERTHIER
Prénoms): CORINNE
… Néfel le: 06.12.1985
Sexe
Sexe: L N
3: PARIS 1ER (75)
ETES
Taie : 170
Cruise Her
| Signature
Le pol
du titulaire :
IDFRABERTHIFR< 88069231028S8CORINNE<<<<<<<6512068F6

EDIT3:
If I remove encoding='utf-8' from with open(filename, 'w') .. I only get the normal characters; every line where there are special characters is not written to the file anymore. python i/o encoding is utf-8 the output of locale -a is C C.UTF-8 POSIX

singrium
  • 2,746
  • 5
  • 32
  • 45
  • 1
    Are you on Windows? What's Python's I/O encoding and what codepage is your system set up to use? See also [the Stack Overflow `character-encoding` tag info page](/tags/character-encoding/info) for troubleshooting tips and suggestions for how to ask a more well-defined question – tripleee Feb 19 '19 at 13:36
  • The edit doesn't help at all. The *text* you are attempting to output is interesting (the actual Unicode or bytes that Python is trying to output). – tripleee Feb 19 '19 at 13:43
  • Oh! Sorry, I didn't understand you, I'll edit the question again adding the output written to the file and the one displayed in the terminal – singrium Feb 19 '19 at 13:46
  • @tripleee, if there are any further information needed, please comment. Thank you for your help. – singrium Feb 19 '19 at 13:50
  • So what is your locale set to and what is Python I/O encoding? – tripleee Feb 19 '19 at 13:53
  • Maybe see also https://stackoverflow.com/questions/2276200/changing-default-encoding-of-python though a lot of it is specific to Python 2. – tripleee Feb 19 '19 at 13:55
  • Possible duplicate of [Changing default encoding of Python?](https://stackoverflow.com/questions/2276200/changing-default-encoding-of-python) – tripleee Feb 19 '19 at 13:58
  • @tripleee, print(sys.getdefaultencoding()) returns utf-8; as for "my local set to", I didn't get what you mean – singrium Feb 19 '19 at 13:58
  • 1
    The output from `locale`, perhaps trimmed down to elide redundant information (we don't care about sort order or monetary settings, so really probably just `LC_CTYPE` and `LANG` and maybe `LC_ALL` if it differs from `LC_CTYPE`) – tripleee Feb 19 '19 at 13:59
  • @tripleee, the output of `locale -a` is : `C C.UTF-8 POSIX` – singrium Feb 19 '19 at 14:09
  • 1
    Not `locale -a`, just `locale`. If your locale is `C` or `POSIX`, try setting it to `C.UTF-8`. – tripleee Feb 19 '19 at 14:16

1 Answers1

1

As said @triplee, the problem was about locale encoding: it was set to POSIX. So the idea, as he suggested, was to set the locale to utf-8 using locale-gen fr_FR.UTF-8 for example.
And since the project I am running is on Docker, I have to write these changes to the Dockerfile-dev.
Fortunately, I found a similar question about the same issue on Docker. So here is what I added to my Dockerfile-dev in order to set the locale to utf-8:

RUN apt-get -qq update && \
    apt-get -q -y upgrade && \
    apt-get install -y sudo curl wget locales && \
    rm -rf /var/lib/apt/lists/*

# Ensure that we always use UTF-8 and with French locale
RUN locale-gen fr_FR.UTF-8


RUN chmod 0755 /etc/default/locale

ENV LC_ALL=fr_FR.UTF-8
ENV LANG=fr_FR.UTF-8
ENV LANGUAGE=fr_FR.UTF-8

After saving to the Dockerfile-dev, I run docker-compose build and docker-compose up.

singrium
  • 2,746
  • 5
  • 32
  • 45