How can a scanned page be divided into words like the reCaptcha project?

Question

I would like to digitize a book in a similar way to the reCaptcha project. Is there already a system for inputing an image and then outputting little images cropped around words? Any ideas on how to do this?

score 0 · Answer 1 · answered Sep 27 '15 at 11:44

0

You should look into the Tesseract OCR project on which reCaptcha was probably based. It has the capability to output the coordinates of recognized words. Then you crop the page to those coords and you are done.

answered Sep 27 '15 at 11:44

beppe9000

1,056
1
13
28

score 0 · Answer 2 · edited May 23 '17 at 10:26

If you just want to split the image in multiple images one word each you could try to find the word bounding boxes and then take those co-ordinates for the splitting. This can be done by taking histograms/projections of the document in horizontal direction and then for each line in vertical direction. An example algorithm with some pictures describing the idea can be found in this paper: "Document Page Decomposition by the Bounding-Box Projection Technique" (http://haralick.org/conferences/71281119.pdf). You could implement this in OpenCV.

Alternativly, you can use Tessaract as mentioned by beppe9000. Perhaps this helps: Getting the bounding box of the recognized words using python-tesseract

But then you get the whole complexity of training OCR even though you only want the bounding boxes.

How can a scanned page be divided into words like the reCaptcha project?

2 Answers2