0

I have here some lines of code from the beginning of my OCR program. I can see with the Time() function that these few lines take 90% of the time of a run. Unfortunately, I have no more idea how to develop these lines more efficiently in terms of time. What would be your approaches to speed up this process?

for page_number,page_data in enumerate(doc):
            txt = pytesseract.image_to_string(page_data,lang='eng').encode('utf-8')
            Counter = 0
            txt = txt.decode('utf-8')
            tokens = txt.split()
 
            for i in tokens:
                ResultpageNumber.append([page_number+1,tokens[Counter],Counter])
                Counter=Counter+1
Dean James
  • 21
  • 6
  • 1
    Does this answer your question? [Pytesseract is very slow for real time OCR, any way to optimise my code?](https://stackoverflow.com/questions/66334737/pytesseract-is-very-slow-for-real-time-ocr-any-way-to-optimise-my-code) – CreepyRaccoon Nov 27 '22 at 23:28
  • Also, it seems you're unnecessarily encoding and decoding the *string*, the `append()` method is also slow and you could use `range()` in your 2nd loop instead of using a counter. For the rest, the wrapper is to blame... – CreepyRaccoon Nov 27 '22 at 23:34
  • Can you please show me what you mean by the range method? and what would be the alternative to append? – Dean James Nov 28 '22 at 07:52
  • By `range`, I mean; `for i in range(len(tokens)): ResultpageNumber.append([page_number + 1, tokens[i], i])`, with this you may remove `Counter `. – CreepyRaccoon Nov 28 '22 at 09:16
  • Thanks, but it currently makes the code slower. – Dean James Nov 28 '22 at 09:25
  • It should not, look at this: https://stackoverflow.com/a/869295/18342123 – CreepyRaccoon Nov 28 '22 at 09:31
  • w.r.t. `append()` method, it's more tricky but there are better ways, e.g.: https://stackoverflow.com/a/311783/18342123 – CreepyRaccoon Nov 28 '22 at 09:31
  • But if I put in your lines code, then unfortunately it slows down.... – Dean James Nov 28 '22 at 09:36

1 Answers1

0

You're saying that .image_to_string() consumes most of the CPU cycles.

Yup. That's not surprising, it's a hard problem we're asking it to solve.

Delve into what that function is doing, if you want to shave off some seconds of CPU time. But you're probably better off consulting the fine documentation.

Depending on your source images, some preprocessing to binarize or to reduce resolution might offer a slightly easier problem, hard to say. There's no magic bullet.

J_H
  • 17,926
  • 4
  • 24
  • 44