10

I am looking out for an example code or API name from OCR (Optical character recognition) in Java using which I can extract all text present from an image file. Without comparing it with any image which I am doing using below code.

public class OCRTest {

    static String STR = "";

    public static void main(String[] args) {
        OCR l = new OCR(0.70f);
        l.loadFontsDirectory(OCRTest.class, new File("fonts"));
        l.loadFont(OCRTest.class, new File("fonts", "font_1"));
        ImageBinaryGrey i = new ImageBinaryGrey(Capture.load(OCRTest.class, "full.png"));
        STR = l.recognize(i, 1285, 654, 1343, 677, "font_1");
        System.out.println(STR);
    }
}
MC Emperor
  • 22,334
  • 15
  • 80
  • 130

3 Answers3

12

You can try Tess4j or JavaCPP Presets for Tesseract. I perfer later as its easier than the former. Add the dependency to your pom `

        <dependency>
            <groupId>org.bytedeco.javacpp-presets</groupId>
            <artifactId>tesseract-platform</artifactId>
            <version>3.04.01-1.3</version>
        </dependency>

` And its simple to code

import org.bytedeco.javacpp.*;
import static org.bytedeco.javacpp.lept.*;
import static org.bytedeco.javacpp.tesseract.*;

public class BasicExample {
    public static void main(String[] args) {
        BytePointer outText;

        TessBaseAPI api = new TessBaseAPI();
        // Initialize tesseract-ocr with English, without specifying tessdata path
        if (api.Init(null, "eng") != 0) {
            System.err.println("Could not initialize tesseract.");
            System.exit(1);
        }

        // Open input image with leptonica library
        PIX image = pixRead(args.length > 0 ? args[0] : "/usr/src/tesseract/testing/phototest.tif");
        api.SetImage(image);
        // Get OCR result
        outText = api.GetUTF8Text();
        System.out.println("OCR output:\n" + outText.getString());

        // Destroy used object and release memory
        api.End();
        outText.deallocate();
        pixDestroy(image);
    }
}

Tess4j is little complex as its requires specific VC++ redistributable package to be installed.

nav3916872
  • 978
  • 13
  • 20
  • is there some jar file for this API – Walid Bousseta Jul 19 '18 at 18:49
  • Yes @WalidBousseta . Both are hosted in maven central repository. You can download the jars from there if you dont use maven to build your project. You can download jars here -> [Tess4J](http://mvnrepository.com/artifact/net.sourceforge.tess4j) and [JavaCPP Presets](https://mvnrepository.com/artifact/org.bytedeco.javacpp-presets/tesseract-platform) and make sure your download other dependent jars also. In case of maven, it automatically downloads all the dependencies for you. – nav3916872 Jul 20 '18 at 04:52
8

You can try javaocr on sourceforge: http://javaocr.sourceforge.net/

There is also a great example with an applet which uses Encog: http://www.heatonresearch.com/articles/42/page1.html

That said, OCR requires a lot of power, so it means that if you are looking for a heavy use, you should look after OCR libraries written in C and integrate that with Java.

OCR is hard. So be sure to qualify your needs before adventuring yourself in it.

Tesseract and opencv (with javacv for integration for instance) are common choices. There are also commercial solutions such as ABBYY FineReader Engine and ABBYY Cloud OCR SDK.

Community
  • 1
  • 1
zenbeni
  • 7,019
  • 3
  • 29
  • 60
3

Open Source OCR engine is available from Google for OCR. It can be processed using CMD. You can process the CMD using java for web applications easily.
Please visit https://www.youtube.com/watch?v=Mjg4yyuqr5E . You will get the step by step details to process OCR using CMD.

Jinu Jawad
  • 31
  • 1