Python: OCR - For loop is very slow

Question

I have here some lines of code from the beginning of my OCR program. I can see with the Time() function that these few lines take 90% of the time of a run. Unfortunately, I have no more idea how to develop these lines more efficiently in terms of time. What would be your approaches to speed up this process?

for page_number,page_data in enumerate(doc):
            txt = pytesseract.image_to_string(page_data,lang='eng').encode('utf-8')
            Counter = 0
            txt = txt.decode('utf-8')
            tokens = txt.split()
 
            for i in tokens:
                ResultpageNumber.append([page_number+1,tokens[Counter],Counter])
                Counter=Counter+1

Does this answer your question? [Pytesseract is very slow for real time OCR, any way to optimise my code?](https://stackoverflow.com/questions/66334737/pytesseract-is-very-slow-for-real-time-ocr-any-way-to-optimise-my-code) — CreepyRaccoon, Nov 27 '22 at 23:28
Also, it seems you're unnecessarily encoding and decoding the *string*, the `append()` method is also slow and you could use `range()` in your 2nd loop instead of using a counter. For the rest, the wrapper is to blame... — CreepyRaccoon, Nov 27 '22 at 23:34
Can you please show me what you mean by the range method? and what would be the alternative to append? — Dean James, Nov 28 '22 at 07:52
By `range`, I mean; `for i in range(len(tokens)): ResultpageNumber.append([page_number + 1, tokens[i], i])`, with this you may remove `Counter `. — CreepyRaccoon, Nov 28 '22 at 09:16
It should not, look at this: https://stackoverflow.com/a/869295/18342123 — CreepyRaccoon, Nov 28 '22 at 09:31
w.r.t. `append()` method, it's more tricky but there are better ways, e.g.: https://stackoverflow.com/a/311783/18342123 — CreepyRaccoon, Nov 28 '22 at 09:31
But if I put in your lines code, then unfortunately it slows down.... — Dean James, Nov 28 '22 at 09:36

score 0 · Answer 1 · answered Nov 27 '22 at 22:49

You're saying that .image_to_string() consumes most of the CPU cycles.

Yup. That's not surprising, it's a hard problem we're asking it to solve.

Delve into what that function is doing, if you want to shave off some seconds of CPU time. But you're probably better off consulting the fine documentation.

Depending on your source images, some preprocessing to binarize or to reduce resolution might offer a slightly easier problem, hard to say. There's no magic bullet.

Python: OCR - For loop is very slow

1 Answers1