UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

Question

I am trying to do OCR on an image file in python using teseract-OCR. My environment is- Python 3.5 Anaconda on Windows Machine.

Here is the code:

from PIL import Image
from pytesseract import image_to_string
out = image_to_string(Image.open('sample.png'))

The error I am getting is :

File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode
return codecs.charmap_decode(input, self.errors, decoding_table)[0]
UnicodeDecodeError:'charmap' codec can't decode byte 0x81 in position 1583: character maps to <undefined>

I have tried the solution mentioned here The hack is not working

I have tried my code on Mac OS it is working.

I have looked into the pytesseract issues: Here is this an open issue

Thanks

score 3 · Accepted Answer · answered Jun 25 '16 at 01:26

Hmm..something very weird going on there - The character "\x81" is unprintable when we talk about the "latin1" text encoding. However, on the "cp1252" encoding the library is using, it is mapped instead to an "undefined character" - this is explicit.

What happens is that "latin1" is somewhat a "no-op" codec, used sometimes in Python to simply translate a byte sequence to an unicode string (the default string in Python 3.x). The codec "cp1252" is almost the samething, and in some contexts it is used interchangeable with latin1 - but this "\x81" code is one difference between the two. In your case, a crucial one.

The correct thing to do there is try to supply the image_to_string function with the optional lang parameter - so that it might use the correct codec to decode your text - if it recognizes better what is the character it is exposing as "0x81". However, this might not work - as it might simply be an OCR error to a very weird character not related to the language at all.

So, the workaround for you is to monkeypatch the "cp1252" codec so that instead of an error, it fills in an Unicode "unrecognized" character - one way to do that is to isnert these lines before calling tesseract:

from encodings import cp1252
original_decode  = cp1252.Codec.decode
cp1252.Codec.decode =  lambda self, input, errors="replace": original_decode(self, input, errors)

But please, if you can, open a bug report against the pytesseract project. My guess is they should be using "latin1" and not "cp1252" encoding at this point.

UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

1 Answers1