Java8, Tess4j : Optimize image for OCR with tesseract

Question

I am working on Tesseract and I have OCR functionality working already. I wanted to optimize the image so that OCR results will be better. Currently I am only making the image monochrome and scaling it to double its size. Even after that I am having issues with smaller fonts.

I tried looking up, and here is one of the top answers I can find. Unfortunately, it works with Bitmap and I cannot find any native class in Java which works with Bitmap. There is also an answer with Java code, but it again uses Bitmap and doesn't specify from which package they get it.

Where does BitmapImageUtil.convertToGrayscale() come from?

Code :

private String testOcr(String fileLocation, int attachId) {
        try {
            File imageFile = new File(fileLocation);
            BufferedImage img = ImageIO.read(imageFile);
            String identifier = String.valueOf(new BigInteger(130, random).toString(32));
            String blackAndWhiteImage = previewPath + identifier + ".png";
            File outputfile = new File(blackAndWhiteImage);
            BufferedImage bufferedImage = BitmapImageUtil.convertToGrayscale(img,new Dimension(img.getWidth(),img.getHeight()));
            bufferedImage = Scalr.resize(bufferedImage,img.getWidth()*2,img.getHeight()*2);
            ImageIO.write(bufferedImage,"png",outputfile);

            ITesseract instance = Tesseract.getInstance();
            // Point to one folder above tessdata directory, must contain training data
            instance.setDatapath("/usr/share/tesseract-ocr/");
            // ISO 693-3 standard
            instance.setLanguage("deu");
            String result = instance.doOCR(outputfile);
// result processing with regex. 
}

Is there a general location in the images where you can expect the text to be, or can it show up anywhere? — CraigR8806, Aug 18 '17 at 10:56
@CraigR8806 : They can be anywhere in the image.. Thank you. — We are Borg, Aug 18 '17 at 11:00
This may or may not be helpful, but with the `Image` class built into Java you have a bit more control of how the image is scaled: https://docs.oracle.com/javase/7/docs/api/java/awt/Image.html If you use `getScaledInstance()` the last parameter allows you to place one of the enums defined by the class in it. You may be able scale your image larger and retain clarity with a different scaling algorithm — CraigR8806, Aug 18 '17 at 11:06
As suggested in the mentioned post, you need to scale the image to 300DPI or 12pt in text size. You then can feed the processed `BufferedImage` object to `doOCR` method directly without having to write it out to an intermediate file (eliminating the I/O ops). — nguyenq, Aug 19 '17 at 13:48
@nguyenq : I will try this out, but currently Tess4j is causing a JVM crash, CHecking that out. — We are Borg, Aug 23 '17 at 09:54

score 0 · Answer 1 · answered Feb 12 '18 at 12:56

0

BitmapImageUtil is from Apache FOP project. ("FOP" = "Formatting Objects Processor")

The package is org.apache.fop.util.bitmap.

Source code for release 2.2 is available here

answered Feb 12 '18 at 12:56

Stewart

17,616
8
52
80

Java8, Tess4j : Optimize image for OCR with tesseract

1 Answers1