If you are interested in OCR, check out Google's Tesseract. The project is Open-Source and according to Google it is " probably the most accurate open source OCR engine available".
For more details and related algorithms used in Tesseract, refer here.
How good is Tesseract on Scanned Pages?
I used the Tesseract to extract text from this scanned image( using the English language training set they provided) -
This is what the Output looked like -
2213 (rout wan w. suns)
HERE dwell rogether still two men of note Who never lived and so can
never die: How very near they seem, ye: how remote Tm age berm me
world went all awry. But still the game’: afoot for rhose with ears
Avtuned to catch the distant View-halloo: England is England yer, for
all our fenrs— Only those lhlngs the heart ézlin/ex are true.
A yellow fog swirls pm the window-pane
A: night descends upon lhls fabled street:
A lonely hansom splashes through the rain,
The ghostly gas lamps ran at (Wenly feet.
Here, though the world explode, these two survive, And it is always
eighteen ninety-five.
MAW. H‘ .9“ Vmczwr Snuuus-rr