2

I want to highlight the bbox's of a particular tag when they selected the tag in structure root. For that reason I am able to get the bbox's when the tag contains Attributes like this

structure.

But I found in some pdf's even though there is no attributes like (/A) , Adobe dc can able to highlight the content(bbox's) when you select the particular tag. How I can get bbox's in this case? The code what I tried to get attributes related bbox's is

String inputPdfFile = "D:/Documents/pdfs/res.pdf";
PDDocument old_document = PDDocument.load(new File(inputPdfFile));
PDStructureTreeRoot treeRoot = old_document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
    for (Object kid2 :((PDStructureElement)kid).getKids()){
        PDStructureElement kid2c = (PDStructureElement)kid2;
        for (Object kid3 : kid2c.getKids()){
            if (kid3 instanceof PDStructureElement){
                PDStructureElement kid3c = (PDStructureElement)kid3;
                System.out.println(kid3c.getAttributes());
            }
        }
    }
}

The pdf link is https://drive.google.com/file/d/1_-tuWuReaTvrDsqQwldTnPYrMHSpXIWp/view?usp=sharing

Please help me any one......

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • The elements of the structure tree correspond to specific drawing instructions in the page content (or dependent content streams) via the marked content ID. You essentially _merely_ have to determine the area in which these drawing instructions draw something. This obviously only gives you the _actual_ bounding box, not the _intended_ or _reserved_ box... – mkl Dec 05 '19 at 10:34
  • @mkl Thanks for Reply. In attached document adobe able to get the tag area. How I can get the bbox's of each tag. Please give some clue. I will try code (applying drawing instructions while tagging and use those instructions while getting bbox's) using pdfBox. – fascinating coder Dec 05 '19 at 11:38
  • 1
    I suspect that what you need here is to call getCOSObject() on these objects. If you hit a dictionary, you could try to call getItem(COSName.BBox). – Tilman Hausherr Dec 05 '19 at 15:08
  • 2
    @TilmanHausherr If I understood the OP correctly, there is no no attribute (**A**) object in the case of the documents he now has to deal with. Neither are there class (**C**) names. Thus, if one wants to know layout details, one has to derive them from the actual drawing instructions in the content streams. – mkl Dec 05 '19 at 16:21
  • @mkl Yes, You are correct. I need to implement the instructions based positions to highlight content. What ever Tilman saying I am able to do it(But it won't solve my problem 100%). Thanks. Help me.. – fascinating coder Dec 06 '19 at 05:18

1 Answers1

4

To determine the actual bounding boxes (in contrast to those given in some Structure Element Layout Attributes), of the text of some marked content, you can use the PDFBox PDFMarkedContentExtractor and combine its results with the PDF Structure Tree data.

The following code does so and creates an output PDF in which the determined bounding boxes are enclosed in colored rectangles:

PDDocument document = PDDocument.load(SOURCE);

Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>();

for (PDPage page : document.getPages()) {
    PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor();
    extractor.processPage(page);

    Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>();
    markedContents.put(page, theseMarkedContents);
    for (PDMarkedContent markedContent : extractor.getMarkedContents()) {
        addToMap(theseMarkedContents, markedContent);
    }
}

PDStructureNode root = document.getDocumentCatalog().getStructureTreeRoot();
Map<PDPage, PDPageContentStream> visualizations = new HashMap<>();
showStructure(document, root, markedContents, visualizations);
for (PDPageContentStream canvas : visualizations.values())
    canvas.close();

document.save(RESULT);

(from the VisualizeMarkedContent method visualize)

It uses the following helper method for recursively mapping the PDMarkedContent objects by their MCID:

void addToMap(Map<Integer, PDMarkedContent> theseMarkedContents, PDMarkedContent markedContent) {
    theseMarkedContents.put(markedContent.getMCID(), markedContent);
    for (Object object : markedContent.getContents()) {
        if (object instanceof PDMarkedContent) {
            addToMap(theseMarkedContents, (PDMarkedContent)object);
        }
    }
}

(VisualizeMarkedContent helper method)

The method showStructure recursively determines the bounding box of structure elements and draws a rectangle for each element respectively. Actually a structure element can contain content across pages, so we have to work with a mapping of pages to bounding boxes in its boxes variable...

Map<PDPage, Rectangle2D> showStructure(PDDocument document, PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents, Map<PDPage, PDPageContentStream> visualizations) throws IOException {
    Map<PDPage, Rectangle2D> boxes = null;
    PDPage page = null;
    if (node instanceof PDStructureElement) {
        PDStructureElement element = (PDStructureElement) node;
        page = element.getPage();
    }
    Map<Integer, PDMarkedContent> theseMarkedContents = markedContents.get(page);
    for (Object object : node.getKids()) {
        if (object instanceof COSArray) {
            for (COSBase base : (COSArray) object) {
                if (base instanceof COSDictionary) {
                    boxes = union(boxes, showStructure(document, PDStructureNode.create((COSDictionary) base), markedContents, visualizations));
                } else if (base instanceof COSNumber) {
                    boxes = union(boxes, page, showContent(((COSNumber)base).intValue(), theseMarkedContents));
                } else {
                    System.out.printf("?%s\n", base);
                }
            }
        } else if (object instanceof PDStructureNode) {
            boxes = union(boxes, showStructure(document, (PDStructureNode) object, markedContents, visualizations));
        } else if (object instanceof Integer) {
            boxes = union(boxes, page, showContent((Integer)object, theseMarkedContents));
        } else {
            System.out.printf("?%s\n", object);
        }

    }
    if (boxes != null) {
        Color color = new Color((int)(Math.random() * 256), (int)(Math.random() * 256), (int)(Math.random() * 256));

        for (Map.Entry<PDPage, Rectangle2D> entry : boxes.entrySet()) {
            page = entry.getKey();
            Rectangle2D box = entry.getValue();
            if (box == null)
                continue;

            PDPageContentStream canvas = visualizations.get(page);
            if (canvas == null) {
                canvas = new PDPageContentStream(document, page, AppendMode.APPEND, false, true);
                visualizations.put(page, canvas);
            }
            canvas.saveGraphicsState();
            canvas.setStrokingColor(color);
            canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
            canvas.stroke();
            canvas.restoreGraphicsState();
        }
    }
    return boxes;
}

(VisualizeMarkedContent method)

The method showContent determines the bounding box of text associated with a given MCID, recursing if need be.

Rectangle2D showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) throws IOException {
    Rectangle2D box = null;
    PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null;
    List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList();
    StringBuilder textContent =  new StringBuilder();
    for (Object object : contents) {
        if (object instanceof TextPosition) {
            TextPosition textPosition = (TextPosition)object;
            textContent.append(textPosition.getUnicode());

            int[] codes = textPosition.getCharacterCodes();
            if (codes.length != 1) {
                System.out.printf("<!-- text position with unexpected number of codes: %d -->", codes.length);
            } else {
                box = union(box, calculateGlyphBounds(textPosition.getTextMatrix(), textPosition.getFont(), codes[0]).getBounds2D());
            }
        } else if (object instanceof PDMarkedContent) {
            PDMarkedContent thisMarkedContent = (PDMarkedContent) object;
            box = union(box, showContent(thisMarkedContent.getMCID(), theseMarkedContents));
        } else {
            textContent.append("?" + object);
        }
    }
    return box;
}

(VisualizeMarkedContent method)

The previous two methods showStructure and showContent make use of the following helpers to build the (page-wise) union of bounding boxes:

Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D>... maps) {
    Map<PDPage, Rectangle2D> result = null;
    for (Map<PDPage, Rectangle2D> map : maps) {
        if (map != null) {
            if (result != null) {
                for (Map.Entry<PDPage, Rectangle2D> entry : map.entrySet()) {
                    PDPage page = entry.getKey();
                    Rectangle2D rectangle = union(result.get(page), entry.getValue());
                    if (rectangle != null)
                        result.put(page, rectangle);
                }
            } else {
                result = map;
            }
        }
    }
    return result;
}

Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D> map, PDPage page, Rectangle2D rectangle) {
    if (map == null)
        map = new HashMap<>();
    map.put(page, union(map.get(page), rectangle));
    return map;
}

Rectangle2D union(Rectangle2D... rectangles)
{
    Rectangle2D box = null;
    for (Rectangle2D rectangle : rectangles) {
        if (rectangle != null) {
            if (box != null)
                box.add(rectangle);
            else
                box = rectangle;
        }
    }
    return box;
}

(VisualizeMarkedContent helper methods)

Finally the method calculateGlyphBounds has been borrowed from the PDFBox example DrawPrintTextLocations to calculate the individual glyph bounding boxes:

private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
{
    GeneralPath path = null;
    AffineTransform at = textRenderingMatrix.createAffineTransform();
    at.concatenate(font.getFontMatrix().createAffineTransform());
    if (font instanceof PDType3Font)
    {
        // It is difficult to calculate the real individual glyph bounds for type 3 fonts
        // because these are not vector fonts, the content stream could contain almost anything
        // that is found in page content streams.
        PDType3Font t3Font = (PDType3Font) font;
        PDType3CharProc charProc = t3Font.getCharProc(code);
        if (charProc != null)
        {
            BoundingBox fontBBox = t3Font.getBoundingBox();
            PDRectangle glyphBBox = charProc.getGlyphBBox();
            if (glyphBBox != null)
            {
                // PDFBOX-3850: glyph bbox could be larger than the font bbox
                glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                path = glyphBBox.toGeneralPath();
            }
        }
    }
    else if (font instanceof PDVectorFont)
    {
        PDVectorFont vectorFont = (PDVectorFont) font;
        path = vectorFont.getPath(code);

        if (font instanceof PDTrueTypeFont)
        {
            PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
            int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
            at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
        }
        if (font instanceof PDType0Font)
        {
            PDType0Font t0font = (PDType0Font) font;
            if (t0font.getDescendantFont() instanceof PDCIDFontType2)
            {
                int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
        }
    }
    else if (font instanceof PDSimpleFont)
    {
        PDSimpleFont simpleFont = (PDSimpleFont) font;

        // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
        // which is why PDVectorFont is tried first.
        String name = simpleFont.getEncoding().getName(code);
        path = simpleFont.getPath(name);
    }
    else
    {
        // shouldn't happen, please open issue in JIRA
        System.out.println("Unknown font class: " + font.getClass());
    }
    if (path == null)
    {
        return null;
    }
    return at.createTransformedShape(path.getBounds2D());
}

(VisualizeMarkedContent method)

The result for your example document:

page 1

page 2

page 3

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Kudos @mkl. Thank you so much. I will play with this code after some time. Thank you so much. I am so thankful to you. – fascinating coder Dec 14 '19 at 08:58
  • Hi @mkl. I treid above code with "https://drive.google.com/file/d/1Yf-ls58fJvrJ4yClZtFS2rSYrEDC5AJ4/view?usp=sharing" this file. It is not detecting Images. The bounding boxes are mismatching . This is the code what I am using "https://drive.google.com/file/d/1al89l-WnZ-Kx8-tV0W9SZKfKYkyN9CQg/view?usp=sharing" Kindly check. – fascinating coder Dec 17 '19 at 13:29
  • *"It is not detecting Images."* - Correct. See the intro to my answer: "To determine the actual bounding boxes (in contrast to those given in some Structure Element Layout Attributes), of **the text** of some marked content"... For content other than text the `PDFMarkedContentExtractor` and the code above both have to be extended a bit. – mkl Dec 17 '19 at 17:46
  • Let me try it out what changes require to detect even images. I want to highlight every tagged content(Images, Vector Images, Links). I will get back to you. Thanks @mkl. – fascinating coder Dec 18 '19 at 06:47
  • @mkl Thanks for the answer. I tried to get image layouts (extracting BBOx's from Attributes). I am able to extract for normal images(just XOBJECT) but not for VECTOR images(means images plus graphics plus text grouped like image). – SuperNova Dec 18 '19 at 07:23
  • @fascinatingcoder did you succeed in highlighting every type of tagged content? Would you be willing to share your code? – themenace Oct 29 '20 at 14:43