3

I'm trying to figure out how to get the coordinates and word rect in a text image after tess4j performs the OCR. i'm quite the beginner so can somebody please break it down for me? Much appreciated.

Koushik Ravikumar
  • 664
  • 11
  • 26

2 Answers2

1

I'm rather new to tess4j myself and I'd hate to disagree with @nguyenq, but here's how I did it

String imageUrl = "...";
File imageFile = new File(imageUrl);
Image image = ImageIO.read(imageFile);
BufferedImage bi = toBufferedImage(image);
ITesseract instance = new Tesseract();

for(Word word : instance.getWords(bi, ITessAPI.TessPageIteratorLevel.RIL_TEXTLINE)) {
  Rectangle rect = word.getBoundingBox();

  System.out.println(rect.getMinX()+","+rect.getMaxX()+","+rect.getMinY()+","+rect.getMaxY()
                    +": "+word.getText());
}

And here's my toBufferedImage method

public static BufferedImage toBufferedImage(Image img)
{
  if (img instanceof BufferedImage)
  {
      return (BufferedImage) img;
  }

  // Create a buffered image with transparency
  BufferedImage bimage = new BufferedImage(img.getWidth(null), img.getHeight(null), BufferedImage.TYPE_INT_ARGB);

  // Draw the image on to the buffered image
  Graphics2D bGr = bimage.createGraphics();
  bGr.drawImage(img, 0, 0, null);
  bGr.dispose();

  // Return the buffered image
  return bimage;
}

SO credit

Edit I should note that this is using tess4j v3.3.1. This new convenience API must have been added by @nguyenq after the initial question was posted

kane
  • 5,465
  • 6
  • 44
  • 72
0

Tess4J's unit tests include examples for obtaining bounding boxes for recognized words. The code is similar to Tess4J: How to use ResultIterator?.

Community
  • 1
  • 1
nguyenq
  • 8,212
  • 1
  • 16
  • 16
  • Thanks you very much. Can i, by any chance, get an entire example code? Just a very simple one. (And can i say, i am very much star struck by you Quan Nguyen.) – Koushik Ravikumar Mar 21 '13 at 17:46
  • The unit tests can be found in the project's code repository: http://sourceforge.net/p/tess4j/code/181/tree/Tess4J_3/trunk/test/net/sourceforge/tess4j/ – nguyenq Mar 25 '13 at 04:06
  • The test case testResultIterator represents a complete example for determining the bounding boxes. The code is rather straight forward -- you should be able to follow it. – nguyenq Mar 25 '13 at 04:24
  • I tried executing the tessiterator code and i witnessed the following error: – Koushik Ravikumar Mar 28 '13 at 15:09
  • # Problematic frame: # C [libtesseract302.dll+0xf834] tesseract::TessBaseAPI::Init+0x34 # # Failed to write core dump. Minidumps are not enabled by default on client versions of Windows I am using Eclipse to build my project. Is thee a patch that can fix the driver file which seems to be the problem here. – Koushik Ravikumar Mar 28 '13 at 15:10
  • It's possible that your files are corrupted. Try download the [distribution](http://sourceforge.net/projects/tess4j/files/tess4j/1.1/) again. – nguyenq Mar 28 '13 at 23:07
  • Done. Still the same error. But the basic example code runs with no errors. – Koushik Ravikumar Mar 29 '13 at 17:28
  • Not all of them. The testTessBaseAPIGetUTF8Text and the testResultIterator fail with the same error. The rest of the test cases in that particular class run without errors. – Koushik Ravikumar Apr 01 '13 at 05:38
  • i think i know what the problem is. I have to implement the TessBaseAPI methods, dont i? The tessBaseAPIImpl class in the unit tests is left unimplememted. – Koushik Ravikumar Apr 01 '13 at 06:42
  • The implementation of all the methods in the tessBaseAPI need to be done by me hence all the unit tests containing the handle for the TeessBaseAPI are failing. – Koushik Ravikumar Apr 01 '13 at 06:51
  • How can i get the source code in the dll file? It would help in creating the wrapper if i had the c++ code to understand... – Koushik Ravikumar Apr 02 '13 at 05:42