0

I have data from Tesseract OCR in a DataFrame. One of the columns in this output represents a confidence score in the OCR result for the individual word of text as output of Tesseract. Tesseract by default, though, struggles with handwritten character recognition. I want to take an approach where I use the bounding box info that is also provided by Tesseract to capture the box around low-confidence words, check if they are handwritten, and use a different model to classify what the handwritten text states. So my DataFrame that OCRs "Hello world, I am Zac!" where "I am Zac!" is handwritten may look like:

         conf    top    left    width    height    text
1        90.0    100     50     67       14        Hello
2        92.0    100     60     67       14        world,
3        54.0    100     65     21       13        l
4        32.0    100     67     29       12        @n;
5         0.0    100     71     37       14        2ao!
6        90.0    100     77     36       12        text
...

Now my actual data has a lot of rows (easily over a thousand per image, with dozens of images), so using .iterrows() may be inefficient, I am not sure. What I need to select are all the consecutive rows with a conf < 60, as long as there is more than one row consecutively. I also would need to select all of these groups of consecutive rows of low confidence separately, because I'm going to need to use the top, left, width, and height values to find all of the words' bounding boxes, sum those boxes together to capture all of the individual boxes in one large box, and pass that new box to a model to predict the handwriting. So from the example, I would want to select:

         conf    top    left    width    height    text
3        54.0    100     65     21       13        l
4        32.0    100     67     29       12        @n;
5         0.0    100     71     37       14        2ao!

I can select all rows where the conf is lower than my threshold, but I am wondering if there is a more effective way to select consecutive rows of data that match this criteria, because if I select all of the values with low confidence, I then would have to iterate to find all of the groups of rows where there is at least 2 (multiple) consecutive low confidence values, and then I have to iterate through all of those selections to get their box info, and doing this for each page of each document seems computationally taxing.

Any advice or suggestions related to the problem (even if the advice is a better way to do what I'm trying to do) would be greatly appreciated.

Z. Shaffer
  • 23
  • 5

1 Answers1

1

Sounds like your notion of consecutive rows refers to the words/tokens appearing next to each other in the original input. Ill suggest apply higher level filters like conf < 60 first (to reduce the search space) then follow something like Detecting consecutive integers in a list to get the list of indices being consecutive. Once you have those indices, use it to filter your dataframe.

synaptikon
  • 699
  • 1
  • 8
  • 16