4

I have problems with the general recognition of subscript and superscript in text fragments.

Example-image:

example-image with subscript and superscript

I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:

  • tessedit_create_hocr = 1 (to get result as HOCR)
  • hocr_font_info = 1 (to get additional font infos like font size)
  • hocr_char_boxes = 1 (to get character-based result)

The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.

In the output the sub/sup-fragments were all more or less wrong:

  • "SubtextSub" is recognized as "Subtextsu,"
  • "SuptextSub" is recognized as "Suptexts?"
  • "P0" is recognized as "Po"
  • "P100" is recognized as "P1go"
  • "a2+b2" is recognized as "a+b?"

Using Tesseract for OCR is there a way to ...?

  1. optimize subscript/superscript handling
  2. get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
MaS
  • 393
  • 3
  • 17
  • 1
    To give a little bit of context: Superscripts and subscripts are important when it comes to chemical formulas. Superscripts are also used for footnotes. The distinction to normal text is relevant when the superscript is after a number: `Revenue in Q1 (in million USD): 54²` is very different from `Revenue in Q1 (in million USD): 542` – Martin Thoma Sep 03 '20 at 08:53

3 Answers3

3

Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.

Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training: link1 and link2.

But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.

Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:

public class SubSupEvaluator {
    public void determineSubSupCharacters(BufferedImage image) {
        //1. initialize Tesseract and set image infos
        TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
        try {
            int bpp = image.getColorModel().getPixelSize();
            int bytespp = bpp / 8;
            int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
            TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
            TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
            TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);

            //2. start actual OCR run
            TessBaseAPIRecognize(handle, null);

            //3. iterate over the result character-wise
            TessResultIterator ri = TessBaseAPIGetIterator(handle);
            TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
            TessPageIteratorBegin(pi);
            do {
                //determine character
                Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
                String character = ptr.getString(0);
                TessDeleteText(ptr); //release memory

                //determine position information
                IntBuffer leftB = IntBuffer.allocate(1);
                IntBuffer topB = IntBuffer.allocate(1);
                IntBuffer rightB = IntBuffer.allocate(1);
                IntBuffer bottomB = IntBuffer.allocate(1);
                TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);

                //write info to console
                System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
                    rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
                    TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
            } while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
        } finally {
            TessBaseAPIDelete(handle); //release memory
        }
    }
}

The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.

MaS
  • 393
  • 3
  • 17
  • your answer seems very promising. I have been looking for an answer to this problem. can you share an example of how to run your code? thanks. – mjpablo23 Nov 04 '20 at 22:32
  • I think most information is in the answer. That means you need Java and the Tess4J-library (see link). How to create a BuffedImage from an image-file can be found in numerous questions here one StackOverflow. – MaS Nov 05 '20 at 10:49
  • ah ok thanks. I am trying to run it on my Mac using Eclipse. I am trying to include the correct log4j and slf4j jar files. But I keep getting this error: Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory at net.sourceforge.tess4j.Tesseract.(Unknown Source) – mjpablo23 Nov 06 '20 at 22:11
  • 1
    Log4j has a lot of jars :-) Try to include the one with api in it. – MaS Nov 10 '20 at 07:11
0

There is very little information on this topic. One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.

See How to detect subscript numbers in an image using OCR?

Related (but otherwise not answering the question):

https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg19434.html

https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp

0

what do you guys think about getting tesseract to recognize single letters?

Tesseract does not recognize single characters

I tried it with the option --psm 10

tesseract imTstg.png out5 --psm 10

but it did not seem to work. I am thinking about just running yolo to detect the single letters.

mjpablo23
  • 681
  • 1
  • 7
  • 23