I am creating a database from historical records which I have as photographed pages from books (+100K pages). I wrote some python code to do some image processing before I OCR each page. Since the data in these books does not come in well formatted tables, I need to segment each page into rows and columns and then OCR each piece separately.
One of the critical steps is to align the text in the image.
For example, this is a typical page that needs to be aligned:
A solution I found is to smudge the text horizontally (I'm using skimage.ndimage.morphology.binary_dilation) and find the rotation that maximizes the sum of white pixels along the horizontal dimension.
This works fine, but it takes about 8 seconds per page, which given the volume of pages I am working with, is way too much.
Do you know of a better, faster way of accomplishing aligning the text?
Update:
I use scikit-image for image processing functions, and scipy to maximize the count of white pixels along the horizontal axis.
Here is a link to an html view of the Jupyter notebook I used to work on this. The code uses some functions from a module I've written for this project so it cannot be run on its own.
Link to notebook (dropbox): https://db.tt/Mls9Tk8s
Update 2:
Here is a link to the original raw image (dropbox): https://db.tt/1t9kAt0z