I have data from Tesseract OCR in a DataFrame. One of the columns in this output represents a confidence score in the OCR result for the individual word of text as output of Tesseract. Tesseract by default, though, struggles with handwritten character recognition. I want to take an approach where I use the bounding box info that is also provided by Tesseract to capture the box around low-confidence words, check if they are handwritten, and use a different model to classify what the handwritten text states. So my DataFrame that OCRs "Hello world, I am Zac!" where "I am Zac!" is handwritten may look like:
conf top left width height text
1 90.0 100 50 67 14 Hello
2 92.0 100 60 67 14 world,
3 54.0 100 65 21 13 l
4 32.0 100 67 29 12 @n;
5 0.0 100 71 37 14 2ao!
6 90.0 100 77 36 12 text
...
Now my actual data has a lot of rows (easily over a thousand per image, with dozens of images), so using .iterrows()
may be inefficient, I am not sure. What I need to select are all the consecutive rows with a conf < 60
, as long as there is more than one row consecutively. I also would need to select all of these groups of consecutive rows of low confidence separately, because I'm going to need to use the top
, left
, width
, and height
values to find all of the words' bounding boxes, sum those boxes together to capture all of the individual boxes in one large box, and pass that new box to a model to predict the handwriting. So from the example, I would want to select:
conf top left width height text
3 54.0 100 65 21 13 l
4 32.0 100 67 29 12 @n;
5 0.0 100 71 37 14 2ao!
I can select all rows where the conf
is lower than my threshold, but I am wondering if there is a more effective way to select consecutive rows of data that match this criteria, because if I select all of the values with low confidence, I then would have to iterate to find all of the groups of rows where there is at least 2 (multiple) consecutive low confidence values, and then I have to iterate through all of those selections to get their box info, and doing this for each page of each document seems computationally taxing.
Any advice or suggestions related to the problem (even if the advice is a better way to do what I'm trying to do) would be greatly appreciated.