Need good OCR for printed source code listing, any ideas?

Question

At my work, I sometimes have to take some printed source code and manually type the source code into a text editor. Do not ask why.

Obviously typing it up takes a long time and always extra time to debug typing errors (oops missed a "$" sign there).

I decided to try some OCR solutions like:

Microsoft Document Imaging - has built in OCR
- Result: Missed all the leading whitespace, missed all the underscores, interpreted many of the punctuation characters incorrectly.
- Conclusion: Slower than manually typing in code.
Various online web OCR apps
- Result: Similar or worse than Microsoft Document Imaging
- Conclusion: Slower than manually typing in code.

I feel like source code would be very easy to OCR given the font is sans serif and monospace.

Have any of you found a good OCR solution that works well on source code?

Maybe I just need a better OCR solution (not necessarily source code specific)?

score 8 · Accepted Answer · answered Dec 11 '09 at 15:11

8

With OCR, there are currently three options:

Abbee FineReader and OminPage. Both are commercial products which are about on par when it comes to features and OCR result. I can't say much about OmniPage but FineReader does come with support for reading source code (for example, it has a Java language library).
The best OSS OCR engine is tesseract. It's much harder to use, you'll probably need to train it for your language.

I rarely do OCR but I've found that spending the $150 on the commercial software weights out the wasted time by far.

answered Dec 11 '09 at 15:11

Aaron Digulla

321,842
108
597
820

I tried tesseract. It failed when I first downloaded it. The online readme specifies that it doesn't come with any training data. I downloaded the English training data from the website and untarred into tessdata subdir. BUT then it still complained about "could not find eng.unicharset". How am I messing this up? – Trevor Boyd Smith Dec 11 '09 at 16:15
3

See what I mean? Tesseract is only free if your time costs nothing. But you can post questions in the tesseract user group. They are friendly there and your input will help to make it easier for the next person to set this beast up. – Aaron Digulla Dec 12 '09 at 12:41
@Aaron Digulla, sir can u share me some OCR libraries that comes under the range $150 to $500 , – Sajjad Ali Khan Dec 30 '15 at 11:41
@Sajjad I don't know any. – Aaron Digulla Jan 12 '16 at 14:01
Would like to point out that without training, *tesseract* does nothing different from a regular ocr, which will ignore all the leading whitespace, missed all the underscores. However, it is also __difficult to train it__, because you need to spend time to get the label for each sample. – Luk Aron Oct 28 '19 at 02:39

Elmue · Answer 2 · 2016-08-03T00:14:11.220

Two new options exists today (years after the question was asked):

1.)

Windows 10 comes with an OCR engine from Microsoft.

It is in the namespace:

Windows.Media.Ocr.OcrEngine

https://msdn.microsoft.com/en-us/library/windows/apps/windows.media.ocr

There is also an example on Github:

https://github.com/Microsoft/Windows-universal-samples/tree/master/Samples/OCR

You need either VS2015 to compile this stuff. Or if you want to use an older version of Visual Studio you must invoke it via traditional COM, then read this article on Codeproject: http://www.codeproject.com/Articles/262151/Visual-Cplusplus-and-WinRT-Metro-Some-fundamentals

The OCR quality is very good. Nevertheless if the text is too small you must amplify the image before. You can download every language that exists in the world via Windows Update - even for handwriting!

2.)

Another option is to use the OCR library from Office. It is a COM DLL. It is available in Office 2003, 2007 and Vista, but has been removed in Office 2010.

http://www.codeproject.com/Articles/10130/OCR-with-Microsoft-Office

The disadvantage is that every Office installation comes with support for few languages. For example a spanish Office installs support for spanish, english, portuguese and french. But I noticed that it nearly makes no difference if you use spanish or english as OCR language to detect a spanish text.

If you convert the image to greyscale you get better results. The recognition is OK, but it did not satisfy me. It makes approximately as much errors as Tesseract although Tesseract needs much more image preprocessing to get these results.

score 3 · Answer 3 · answered Apr 22 '15 at 21:04

Try http://www.free-ocr.com/. I have used it to recover source code from a screen grab when my IDE crashes in an editor session without warning. It obviously depends on the font you are using in the editor (I use Courier New 10pt in Delphi). I tried to use Google Docs, which will OCR an image when you upload it - while Google Docs is pretty good on scanned documents, it fails miserably on Pascal source for some reason.

An example of FreeOCR at work: Input image:

image uploaded

gave this:

begin
FileIDToDelete := FolderToClean + 5earchRecord.Name ;
Inc (TotalFilesFound) ;
if (DeleteFile (PChar (FileIDToDelete))) then
begin
Log5tartupError (FormatEx (‘%s file %s deleted‘, [Annotation, Fi eIDToDelete])) ;
Inc (TotalFilesDeleted) ;
end
else
begin
Log5tartupError (FormatEx (‘Error deleting %s file %s‘, [Annotat'on, FileIDToDelete])) ;
Inc (TotalFilesDeleteErrors) ;
end ;
end ;
FindResult := 5ysUtils.FindNext (5earchRecord) ;
end ;

so replacing the indentation is the bulk of the work, then changing all 5's to upper case S. It also got confused by the vertical line at the 80 column mark. Luckily most errors will be picked up by the compiler (with the exception of mistakes inside quoted strings).

It's a shame FreeOCR doesn't have a "source code" option, where white space is treated as significant.

A tip: If your source includes syntax highlighting, make sure you save the image as grayscale before uploading.

score 1 · Answer 4 · answered Dec 11 '09 at 14:58

Printed text vs handwritten is usually easier for OCR, however it all depends on your source image, I generally find that capturing in PNG format, with reduced colors (grayscale is best) with some manual cleanup (remove any image noise due to scanning etc) works best.

Most OCR are similar in performance and accuracy. OCRs with the ability to train/correct would be best.

score 1 · Answer 5 · edited Dec 21 '12 at 10:42

1

In general I found that FineReader gives very good results. Normally all products has a trial available. Try as much you can.

Now, program source code can be tricky:

leading whitespace: maybe a post code pretty printer process can help
underscores and punctuation: maybe a good product can be trained for that

edited Dec 21 '12 at 10:42

roundcrisis

17,276
14
60
92

answered Dec 11 '09 at 15:07

PeterMmm

24,152
13
73
111

score 1 · Answer 6 · answered Dec 11 '09 at 15:24

1

OCRopus is also a good open source option. But like Tesseract, there's a rather steep learning curve to use and integrate it effectively.

answered Dec 11 '09 at 15:24

clartaq

5,320
3
39
49

FuturrCoder · Answer 7 · 2020-05-19T00:25:49.180

Google Drive's built-in OCR worked pretty well for me. Just convert scans to a PDF, upload to Google Drive, and choose "Open with... Google Docs". There are some weird things with color and text size, but it still includes semicolons and such.

The original screenshot: The Google Docs OCR:

Plaintext version:

#include <stdio.h> int main(void) { 
char word[51]; int contains = -1; int i = 0; int length = 0; scanf("%s", word); while (word[length] != "\0") i ++; while ((contains == 1 || contains == 2) && word[i] != "\0") { 
if (word[i] == "t" || word[i] == "T") { 
if (i <= length / 2) { 
contains = 1; } else contains = 2; 
return 0;

Need good OCR for printed source code listing, any ideas?

7 Answers7