0

I'm building an app in Java that scan receipt, and extract all the text using OCR with the tesseract library. I've run the program on 2 images, one that I've take, and one from the internet, and I'm getting an almost perfect result with the one from the internet, but got random string from my image. How do I change that ? Do I need perfect quality image in high resolution ?

I've tried to take better images, even images with juste a single word, and I'm not getting anything.

Tesseract instance = new Tesseract();
instance.setDatapath(pathToMyTessData); 
instance.setLanguage("fra");

String result = instance.doOCR(new File(myReceiptFile));
System.out.println(result);

The receipt I'm trying to scan contains a lot of (useless for me) informations that I don't want to extract, is there any way to extract only food-items, date, total, etc ... ?

P.S: My ticket looks like this

rlasvenes
  • 19
  • 8

2 Answers2

0

Maybe you should train your tesseract , there is another post about this. here

  • Training has sense only for special non standard fonts or for character missing in training data, which is not the case for about mentioned image. – user898678 Oct 03 '19 at 06:11
0

You probably miss this SO topic image processing to improve tesseract OCR accuracy

If you want to have perfect result, maybe you will need to do custom layout analyze, so you can send to tesseract consisted text area (=> same size of font size).

user898678
  • 2,994
  • 2
  • 18
  • 17