2

I used Pytesseract module for OCR. It seems slow process. So I followed Pytesseract is too slow. How can I make it process images faster? .

I used code mentioned in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/xvTFjYCDRQU/rCEwjZL3BQAJ . But getting error !strcmp(locale, "C"):Error:Assert failed:in file baseapi.cpp, line 201 Segmentation fault (core dumped), Then i check some post and get reference to add in my code locale.setlocale(locale.LC_ALL, "C").

So after added this in my code I got another error

Traceback (most recent call last):
  File "master_doc_test3.py", line 107, in <module>
    tess = Tesseract()
  File "master_doc_test3.py", line 67, in __init__
    if self._lib.TessBaseAPIInit3(self._api, datapath, language):
ctypes.ArgumentError: argument 3: <class 'TypeError'>: wrong type`

Can anyone give idea about this error? OR If anyone have idea about best way to make OCR in fastest way using python.

Community
  • 1
  • 1
Rajesh das
  • 111
  • 1
  • 9

1 Answers1

1

You should try to convert to bytes every parameter you pass to ctypes lib calls:

self._lib.TessBaseAPIInit3(self._api, datapath, language)

Something like this is working for me:

self._lib.TessBaseAPIInit3(self._api, bytes(datapath, encoding='utf-8'), bytes(language, encoding='utf-8'))

I have got the clue here. Please, take into consideration that the code you are using needs more fine tuning in other lib calls as the next ones:

tess.set_variable(bytes("tessedit_pageseg_mode", encoding='utf-8'), bytes(str(frame_piece.psm), encoding='utf-8'))
tess.set_variable(bytes("preserve_interword_spaces", encoding='utf-8'), bytes(str(1), encoding='utf-8'))
David Buck
  • 3,752
  • 35
  • 31
  • 35
derwyddon
  • 69
  • 1
  • 3