Can Tesseract OCR recognize subscripts and superscripts?

Question

I have problems with the general recognition of subscript and superscript in text fragments.

Example-image:

I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:

tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)

The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.

In the output the sub/sup-fragments were all more or less wrong:

"Subtext_Sub" is recognized as "Subtextsu,"
"Suptext^Sub" is recognized as "Suptexts?"
"P₀" is recognized as "Po"
"P₁₀₀" is recognized as "P1go"
"a²+b²" is recognized as "a+b?"

Using Tesseract for OCR is there a way to ...?

optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)

To give a little bit of context: Superscripts and subscripts are important when it comes to chemical formulas. Superscripts are also used for footnotes. The distinction to normal text is relevant when the superscript is after a number: `Revenue in Q1 (in million USD): 54²` is very different from `Revenue in Q1 (in million USD): 542` — Martin Thoma, Sep 03 '20 at 08:53

score 3 · Answer 1 · answered Sep 22 '20 at 06:52

Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.

Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training: link1 and link2.

But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.

Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:

public class SubSupEvaluator {
    public void determineSubSupCharacters(BufferedImage image) {
        //1. initialize Tesseract and set image infos
        TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
        try {
            int bpp = image.getColorModel().getPixelSize();
            int bytespp = bpp / 8;
            int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
            TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
            TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
            TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);

            //2. start actual OCR run
            TessBaseAPIRecognize(handle, null);

            //3. iterate over the result character-wise
            TessResultIterator ri = TessBaseAPIGetIterator(handle);
            TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
            TessPageIteratorBegin(pi);
            do {
                //determine character
                Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
                String character = ptr.getString(0);
                TessDeleteText(ptr); //release memory

                //determine position information
                IntBuffer leftB = IntBuffer.allocate(1);
                IntBuffer topB = IntBuffer.allocate(1);
                IntBuffer rightB = IntBuffer.allocate(1);
                IntBuffer bottomB = IntBuffer.allocate(1);
                TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);

                //write info to console
                System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
                    rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
                    TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
            } while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
        } finally {
            TessBaseAPIDelete(handle); //release memory
        }
    }
}

The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.

your answer seems very promising. I have been looking for an answer to this problem. can you share an example of how to run your code? thanks. — mjpablo23, Nov 04 '20 at 22:32
I think most information is in the answer. That means you need Java and the Tess4J-library (see link). How to create a BuffedImage from an image-file can be found in numerous questions here one StackOverflow. — MaS, Nov 05 '20 at 10:49
ah ok thanks. I am trying to run it on my Mac using Eclipse. I am trying to include the correct log4j and slf4j jar files. But I keep getting this error: Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory at net.sourceforge.tess4j.Tesseract.(Unknown Source) — mjpablo23, Nov 06 '20 at 22:11
Log4j has a lot of jars :-) Try to include the one with api in it. — MaS, Nov 10 '20 at 07:11

score 0 · Answer 2 · answered Sep 10 '20 at 07:50

There is very little information on this topic. One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.

See How to detect subscript numbers in an image using OCR?

Related (but otherwise not answering the question):

https://www.mail-archive.com/tesseract-ocr@googlegroups.com/msg19434.html

https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp