0

I am using pytesseract to recognize text as follow

td = pytesseract.image_to_data(img, output_type=Output.DICT)
tn_boxes = len(td['level'])
for o in range(0, tn_boxes):
    text = td['text'][o]
    print(text)

i am just making an index of Examples by using a simple logic detect keyword 'Example no.' find it's end point keyword 'Sol.' and put a piece of image from keyword 'Example no.' to keyword 'Sol.' into index and then find next example and so on
But when i try following image image without line above it Then it show output SET THEORY ae . . 5 (6) Let A = {x: x isa negative odd integer} = {-1,-3,-5,-7,...etc
See how it is not recognizing first line Sol. (a) Let A={x:x is a natural number..etc.
And when i try it with following image not having horizontal line image without line above it it just works fine.

Is there any way to configure pytesseract to recognize text with having a line above it ?

Edited:

sometimes when we place some image above text or some other text with higher size then pytesseract fails to detect text below that bigger object.

Is there any solution for this kind of problem may be there is a way to configure detection minimum size or configure to detect all possible sized text even under some bigger objects ?

For example it show output usually denoted by o(G). ors a a {= 7 Wave =e () oe that the set of ae | group usual ition of integers.
See how it is not detecting keyword Example 1. for folowing image enter image description here

But when i try following image it shows output usually denoted by o(G). Example 1. (2) Prove that th . group under usual addition of integers, Now it is detecting keyword Example 1. enter image description here

Community
  • 1
  • 1
Navpreet Devpuri
  • 503
  • 4
  • 19
  • 1
    what about removing automatically the black line ? you can easily detect it based and its size (almost the whole width) and position (just above the Sol. text) You can even use it to undistort the text, but that's another topic ;-) – antoine Jun 09 '20 at 12:22
  • Thanku for a solution i will try this. But sometimes when we place some image above text or some other text with higher size then pytesseract fails to detect text below that bigger object. Can you suggest any solution for this kind of problem may be there is a way to configure detection minimum size or configure to detect all possible sized text even under some bigger objects – Navpreet Devpuri Jun 09 '20 at 15:42
  • i submitted a issue https://github.com/tesseract-ocr/tesseract/issues/3011 – Navpreet Devpuri Jun 09 '20 at 17:07

2 Answers2

1

Read e.g. image processing to improve tesseract OCR accuracy and read the docs.

user898678
  • 2,994
  • 2
  • 18
  • 17
  • I found a better dewrapper is [ocrd_cis](https://github.com/cisocrgroup/ocrd_cis) but for now i don't know how to use it And When we scale up given image to a scale actor 3 then it detects keyword `Example 1.` But now question is how to find that scale factor to get best results i asked that question [here](https://stackoverflow.com/questions/62480172/tesseract-ocr-act-weird-while-scalling-up-image-size-how-to-know-which-scale-fa) – Navpreet Devpuri Jun 20 '20 at 00:46
  • i want best results what should i try ? is there any way to configure minimun and maximum font size ? – Navpreet Devpuri Jun 20 '20 at 00:48
1

You can try dewarping the image. I used this repo dewarp-github
The code is written in version 2 of python. If you are using version 3+ of python, you can convert this code into version 3 using 2to3. It needed some modifications for my case which were not too complex to handle.

Beginner
  • 61
  • 11