7

Iam trying to extract text from an image file using Tesseract OCR in Python but I'am facing an Error that i can figure out how to deal with it. all my environment is good as i tested some sample image with the ocr in python!

here is the code

from PIL import Image
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))

print (strs)

the follow is the error I get from eclipse console

strs = pytesseract.image_to_string(Image.open('binarized_body.png'))
  File "C:\Python35x64\lib\site-packages\pytesseract\pytesseract.py", line 167, in image_to_string
    return f.read().strip()
  File "C:\Python35x64\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 20: character maps to <undefined>

Iam using python 3.5 x64 on Windows10

Nwawel A Iroume
  • 1,249
  • 3
  • 21
  • 42
  • This reminds me of something I've encountered in the past; I don't know if it's exactly the same issue though. The fact that you're on Windows tipped me off - Python in CMD on windows seems to have a strange default code page. Have you tried hacking around at [`sys.setdefaultencoding`](http://stackoverflow.com/questions/2276200/changing-default-encoding-of-python) to see if that helps you diagnose the problem? (I'd probably avoid keeping that hack around in production code if you can help it though.) – Benjamin Hodgson Dec 15 '15 at 15:43
  • Possible duplicate of [Pytesseract: UnicodeDecodeError: 'charmap' codec can't decode byte](https://stackoverflow.com/questions/32927631/pytesseract-unicodedecodeerror-charmap-codec-cant-decode-byte) – Sreeragh A R Jun 06 '18 at 07:41

2 Answers2

8

The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. On another platform you won't encounter this error because it will get to use UTF-8.

You can try using a different function (possibly one that returns bytes instead of str so you won't have to worry about encoding). You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output.

If you want a quick and dirty solution that won't break anything (yet), here's a way that you might consider:

import builtins

original_open = open
def bin_open(filename, mode='rb'):       # note, the default mode now opens in binary
    return original_open(filename, mode)

from PIL import Image
import pytesseract

img = Image.open('binarized_image.png')

try:
    builtins.open = bin_open
    bts = pytesseract.image_to_string(img)
finally:
    builtins.open = original_open

print(str(bts, 'cp1252', 'ignore'))
randomusername
  • 7,927
  • 23
  • 50
  • Seems like there's some good information potentially related to this answer [here](http://stackoverflow.com/questions/18729148/unicode-characters-not-rendering-with-pil-imagefont). – MPlanchard Dec 15 '15 at 15:50
  • Yeah, that sounds like the issue I've run into before. This answer would be better if you gave some code explaining how to configure PyTesseract to open that file with a UTF8 encoding, if possible – Benjamin Hodgson Dec 15 '15 at 15:51
  • @BenjaminHodgson PyTesseract doesn't have a way to specify the encoding, but we can inject our own `open` alternative... – randomusername Dec 15 '15 at 16:19
  • @randomusername does your solution have impact on the fidelity of the resulting text extracted? iam getting lot of strange characters whereas the original document is plain english char even if it a little bit blured! an example is like iÃŽc1-zo1sîâzzaïzÃœl VE0Ã2ÃŽE BP797Z5SiÃŽc1-zo1sîâzzaïzÃœl VE0Ã2ÃŽE BP797Z5S – Nwawel A Iroume Dec 16 '15 at 08:32
  • 1
    @NwawelAIroume no, but it does have a severe impact on the resulting output. Try printing the output as the original `bytes` object to see if you can salvage what you can. Or you can store the output in a file and use a UTF-8 capable text editor to view it. – randomusername Dec 16 '15 at 15:30
3

I've had the same problem as you but I had to save the output of pytesseract to a file. So, I created a function for ocr with pytesseract and when saving to a file added parameter encoding='utf-8' so my function now looks like this:

def image_ocr(image_path, output_txt_file_name):
  image_text = pytesseract.image_to_string(image_path, lang='eng+ces', config='--psm 1')
  with open(output_txt_file_name, 'w+', encoding='utf-8') as f:
    f.write(image_text)

I hope this helps someone :)

Novak
  • 2,143
  • 1
  • 12
  • 22