How to detect text blocks and columns in pdf with tess4j

Question

I'm new to Tesseract (tess4j), managed to used main features like reading the text or getting the words positions both from image or pdf, rotating etc..

I can't find, and not sure if it is possible to easily detect blocks of text (paragraphs or columns). Also, if there are some other blocks in pdf like images or something else, is it possible to get it somehow, or at least to get the position of the block (box).

It can be any PDF. I need to detect if there are images inside, and the positions if images exists. — Djordje Ivanovic, Feb 23 '17 at 15:01
If the PDF is something like an advertisement flyer, Tesseract couldn't meet your requirement. There's a trade off in neural network which is general v.s. accuracy. What you can do is manually choose the text block or write a piece of code if there are some patterns in your PDFs. — Top.Deck, Feb 23 '17 at 15:08
it can be the book for example. a lot of text and the image here and there. Is it possible? or same as for the flyer? — Djordje Ivanovic, Feb 23 '17 at 15:22
OpenCV probably is the library you are looking for to detect text blocks. This [post](http://stackoverflow.com/questions/23506105/extracting-text-opencv) may help. — Top.Deck, Feb 23 '17 at 15:30
Opencv is not an option for now, but if I fail to find the way with tesseract, I will check that as well. Thanks for your time! — Djordje Ivanovic, Feb 23 '17 at 15:37

score 2 · Accepted Answer · answered Feb 25 '17 at 15:57

2

You can use TessBaseAPIGetComponentImages API method, as follows:

Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, TRUE, null, null);

Check Tess4J unit tests for complete examples.

answered Feb 25 '17 at 15:57

nguyenq

8,212
1
16
16

Ah, you saved me a lot of time!! Nice one. It seems that this is what I need, I will play a bit with it... thanks!! (I will accept the answer soon) – Djordje Ivanovic Feb 25 '17 at 20:19
If I set TessPageIteratorLevel.RIL_BLOCK it is always returning only one box even if I have more text blocks. For TEXT_LINE it is returning correct lines. I even tried with RIL_PARA, same result, only one box. Any idea how to improve this? – Djordje Ivanovic Feb 27 '17 at 07:40
Ok, I fixed it by adding the api.TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD); Can you tell me what the parameter text_only means? If it is set to false, will it return blocks that contain images? – Djordje Ivanovic Feb 27 '17 at 10:44
If I set it to false, it is recognizing the image as a box, but not sure how to get the image from the box... – Djordje Ivanovic Feb 27 '17 at 13:47

score 1 · Answer 2 · answered Feb 28 '17 at 10:35

I already accepted the answer but here is the result of that answer:

public Page recognizeTextBlocks(Path path) {
        log.info("TessBaseAPIGetComponentImages");
        File image = new File(path.toString());
        Leptonica leptInstance = Leptonica.INSTANCE;
        Pix pix = leptInstance.pixRead(image.getPath());
        Page blocks = new Page(pix.w,pix.h);        
        api.TessBaseAPIInit3(handle, datapath, language);
        api.TessBaseAPISetImage2(handle, pix);
        api.TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
        PointerByReference pixa = null;
        PointerByReference blockids = null;
        Boxa boxes = api.TessBaseAPIGetComponentImages(handle, TessPageIteratorLevel.RIL_BLOCK, FALSE, pixa, blockids);
        int boxCount = leptInstance.boxaGetCount(boxes);
        for (int i = 0; i < boxCount; i++) {
            Box box = leptInstance.boxaGetBox(boxes, i, L_CLONE);
            if (box == null) {
                continue;
            }
            api.TessBaseAPISetRectangle(handle, box.x, box.y, box.w, box.h);
            Pointer utf8Text = api.TessBaseAPIGetUTF8Text(handle);
            String ocrResult = utf8Text.getString(0);
            Block block = null;
            if(ocrResult == null || (ocrResult.replace("\n", "").replace(" ","")).length() == 0){
                block = new ImageBlock(new Rectangle(box.x, box.y, box.w, box.h));
            }else{
                block = new TextBlock(new Rectangle(box.x, box.y, box.w, box.h), ocrResult); 
            }
            blocks.add(block);
            api.TessDeleteText(utf8Text);
            int conf = api.TessBaseAPIMeanTextConf(handle);
            log.debug(String.format("Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s", i, box.x, box.y, box.w, box.h, conf, ocrResult));
        }

        //release Pix resource
        PointerByReference pRef = new PointerByReference();
        pRef.setValue(pix.getPointer());
        leptInstance.pixDestroy(pRef);

        return blocks;
    }

Note: Classes Block, ImageBlock and TextBlock are from my project, not part of the tess4j or tesseract

In addition to `Pix`, `Box` and `Boxa` objects would need to be properly disposed of as well, I just noticed. Use `LeptUtils.dispose` method for that purpose. — nguyenq, Mar 02 '17 at 14:06

How to detect text blocks and columns in pdf with tess4j

2 Answers2