3

I'm writing a python script where I use multiproccesing library to launch multiple tesseract instances in parallel. when I use multiple calls to tesseract but in sequence using loop ,it works .However ,when I try to parallel code everything looks fine but I'm not getting any results (I waited for 10 minutes ).

In my code I try to Ocrize multiple pdf pages after I split them from the original multi page PDF.

Here's my code :

def processPage(i):



    nameJPG="converted-"+str(i)+".jpg"
    nameHocr="converted-"+str(i)
    p=subprocess.check_call(["tesseract",nameJPG,nameHocr,"-l","eng","hocr"])
    print "tesseract did the job for the ",str(i+1),"page" 

pool1=Pool(4)
    pool1.map(processPage, range(len(pdf.pages)))
hamma
  • 129
  • 2
  • 14

2 Answers2

2

As what i know of pytesseract it will not allow multiple processes if you have quadcore and you are running 4 processes simultaneously than tesseract will be choked and you will have high cpu usage and other stuffs if you require this for company and you dont want to go with google vision api you have to set multiple servers and do socket programming to request text from different servers so that number of parallel process are less than ability of your server to run different processes at same time like for quad core it should be 2 or 3 or other wise you can hit google vision api they have lot of servers and there output is quite good too Disabling multiprocessing in tesseract will also help It can be done by setting OMP_THREAD_LIMIT=1 in the environment. but you must not run multiple process at same servers for tesseract

See https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167

Dharman
  • 30,962
  • 25
  • 85
  • 135
vsnu
  • 21
  • 2
0

Your code is launching a Pool and exiting before it finishes its job. You need to call close and join.

pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
pool1.close()
pool1.join()

Alternatively, you can wait for its results.

pool1=Pool(4)
print pool1.map(processPage, range(len(pdf.pages)))
noxdafox
  • 14,439
  • 4
  • 33
  • 45
  • Nope,doesn't work either .In fact,the problem is not with process closing the problem with tesseract itself :even when I launch it alone I have 300% of my cpu running.Normally for a pdf page it takes 10s .Now it keeps running without stopping – hamma Jun 21 '17 at 08:31
  • Have you tried to call it via `subprocess` itself without the process `Pool`? – noxdafox Jun 21 '17 at 12:28
  • I don't think there's any solution to my problem: I launched 2 tesseract (extracting text from pdf) at the same time and even from 2 separate terminals I'm not getting any results. – hamma Jun 21 '17 at 12:33
  • 1
    Tesseract supports multithreading and multiprocessing and I've been using it extensively in multiple processes. Therefore, there might be some problem in the way you execute it or in your environment. – noxdafox Jun 21 '17 at 13:33
  • I'm using tesseract 4 and ubuntu 16.04 – hamma Jun 21 '17 at 13:35
  • I am facing the _exact_ same problem as @hamma. The code works well in windows but takes a hell lot of time in Ubuntu. Did you use tesseract in windows? If not, then can you take a look at [this](https://stackoverflow.com/questions/53468446/pytesseract-call-working-very-slow-when-used-along-with-multiprocessing). – Mooncrater Dec 16 '18 at 14:01