3

I'm working on a project that entails photographing text (from any hard copy of text) and converting that text into a text file. Then I'd like to use that text file to do some different things, such as provide hyperlinks to news articles or allow the user to edit the document.

The tool I've tried so far is Java OCR from sourceforge.net, which works fine on the images provided in the package. But when I photograph my own text, it doesnt work at all. Is there some training process I should be implementing? If so, does anybody know how to implement it? Any help will go a long way. Thank you!

Agent Pants
  • 91
  • 2
  • 3
  • 10
  • Just came across this project. No idea if it's any good. http://sourceforge.net/projects/tcrneuroph/ – Stewart Dec 05 '12 at 16:24
  • Interesting. I've come a long way since this question, and ended up downloading VirtualBox and running GOCR on it. But the virtual machine has a world of problems on its own! Lord have mercy. – Agent Pants Dec 10 '12 at 00:51

1 Answers1

0

I have a java application where I ended up deciding to use Tesseract OCR, and just call out to it using Runtime.exec(). Perhaps not quite the answer you need, but just in case you'd not considered it.


Edit + code added in response to comment reply

  • On a Windows installation I think I was able to use an installer, or unzip a ready made binary.
  • On a Linux server, I needed to compile Tesseract myself, but it's not too hard if you're used to that kind of thing (gcc); the only gotcha is that there's a dependency on Leptonica which also needs to be compiled.

    // Tesseract can only handle .tif format, so we have to convert it
    ImageIO.write( ImageIO.read( new java.io.File(file.getPath())), "tif", tmpFile[0]);
    
    String[] tesseractCmd = new String[]{"tesseract", tmpFile[0].getAbsolutePath(), StringUtils.removeEnd(tmpFile[1].getAbsolutePath(), ".txt")};
    final Process process = Runtime.getRuntime().exec(tesseractCmd);
    try {
        int exitValue = process.waitFor();
        if(exitValue == 0) {
            final String extractedText = SearchableTextExtractionUtils.extractPlainText(new FileReader(tmpFile[1]));
            return extractedText;
        }
        throw new SearchableTextExtractionException(exitValue, Arrays.toString(tesseractCmd));
    } catch (InterruptedException e) {
        throw new SearchableTextExtractionException(e);
    } finally {
        process.destroy();
    }
    
Stewart
  • 17,616
  • 8
  • 52
  • 80
  • 1
    Thanks for the input. Tesseract is proving to be extremely difficult to set up. Would you be able to enlighten me on how you set it up? Also, a resource on how to properly implement Runtime.exec() would be great. Thanks again for your help. – Agent Pants Nov 13 '12 at 18:40
  • This looks great. I was able to get something working with the command line using runtime.exec, but tesseract still isn't installing. I think the reason why is because my Mac OS is outdated (version 10.5.8), and it doesn't have certain linux commands like "make" and "sudo apt-get". I can't download XCode to get those commands because its only available for 10.6 and later. I also can't install gcc because of that. Do you know of a simpler OCR engine that works on 10.5.8 by chance? Thanks again for your help. If you don't know of any, I'll try a different machine. – Agent Pants Nov 19 '12 at 13:23
  • http://stackoverflow.com/questions/4360110/installing-gcc-to-mac-os-x-leopard-without-installing-xcode – Stewart Nov 21 '12 at 01:10