How to heal inconsistent parent tree mappings in a PDF created by pdfBox

Question

We are creating pdf documents in Java using pdfBox. Since they should be accessible by Screenreaders, we are using tags and we are setting up a parentTree and we add that to the document catalog.

Please find an example file here.

When we check the resulting pdf with PAC3 validator we get 25 errors for inconsistent entries in the structural parent tree.

enter image description here

Same result but more details in Adobe prefight syntax error check. The error message is

Inconsistent ParentTree mapping (ParentTree element 0) for structure element 
Traversal Path:->StructTreeRoot->K->K->[1]->K->[3]->K->[4]

Adobe preflight syntax error check

When i try to follow that traversal path in pdfBox Debugger, i see an element referencing the ID 22.

Now my questions are:

What is the connection between the StructTreeRoot and the ParentTree?
Where in the StructTreeRoot/ParentTree can i find the item with ID 22 that is refered to in node K->K->2->K->4->K->4? See image PDF Debugger
What is that Parent Tree element 0 in the Preflight error message? See image Adobe preflight syntax error check

PDF Debugger

I think, building accessible pdf with pdfBox as well as error messages from common validation tools are rather poorly documented. Or where can i find more information about it?

Thanks a lot for your help.

@mkl please find an example file [here](https://www.dropbox.com/s/fq6m3o4rx9swq76/Testdatei.pdf?dl=0) — rsr03, Dec 18 '19 at 09:35
The issue in your PDF reminds very much of the issue discussed last in [this answer](https://stackoverflow.com/a/57592766/1729265) to the question [“Find Tag from Selection” is not working in tagged pdf?](https://stackoverflow.com/q/57591441/1729265) by [fascinating coder](https://stackoverflow.com/users/11956879/fascinating-coder): In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent and to have the MCID in question as kid. — mkl, Dec 18 '19 at 10:57
Instead you should simply reference the actual parent structure element of the MCID. — mkl, Dec 18 '19 at 11:13
@mkl Thanks for your comments. I think, you're pushing us in the right direction. — rsr03, Dec 18 '19 at 13:15

mkl · Accepted Answer · 2019-12-19T13:14:20.457

The issue in your PDF reminds very much of the issue discussed in the last section "Yet another issue with parent tree entries" in this answer to the question “Find Tag from Selection” is not working in tagged pdf? by fascinating coder:

In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.

Instead you should simply reference the actual parent structure element of the MCID.

As your question title asks how to heal inconsistent parent tree mappings in a PDF created by pdfBox, here an approach to fix your parent tree by rebulding the parent tree from the structure tree.

First recursively collect MCIDs and their parent structure tree elements by page, e.g. using a method like this:

void collect(PDPage page, PDStructureNode node, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
    COSDictionary pageDictionary = node.getCOSObject().getCOSDictionary(COSName.PG);
    if (pageDictionary != null) {
        page = new PDPage(pageDictionary);
    }

    for (Object object : node.getKids()) {
        if (object instanceof COSArray) {
            for (COSBase base : (COSArray) object) {
                if (base instanceof COSDictionary) {
                    collect(page, PDStructureNode.create((COSDictionary) base), parentsByPage);
                } else if (base instanceof COSNumber) {
                    setParent(page, node, ((COSNumber)base).intValue(), parentsByPage);
                } else {
                    System.out.printf("?%s\n", base);
                }
            }
        } else if (object instanceof PDStructureNode) {
            collect(page, (PDStructureNode) object, parentsByPage);
        } else if (object instanceof Integer) {
            setParent(page, node, (Integer)object, parentsByPage);
        } else {
            System.out.printf("?%s\n", object);
        }
    }
}

(RebuildParentTreeFromStructure method)

with this helper method

void setParent(PDPage page, PDStructureNode node, int mcid, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
    if (node == null) {
        System.err.printf("Cannot set null as parent of MCID %s.\n", mcid);
    } else if (page == null) {
        System.err.printf("Cannot set parent of MCID %s for null page.\n", mcid);
    } else {
        Map<Integer, PDStructureNode> parents = parentsByPage.get(page);
        if (parents == null) {
            parents = new HashMap<>();
            parentsByPage.put(page, parents);
        }
        if (parents.containsKey(mcid)) {
            System.err.printf("MCID %s already has a parent. New parent rejected.\n", mcid);
        } else {
            parents.put(mcid, node);
        }
    }
}

(RebuildParentTreeFromStructure helper method)

and then rebuild based on the collected information:

void rebuildParentTreeFromData(PDStructureTreeRoot root, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
    int parentTreeMaxkey = -1;
    Map<Integer, COSArray> numbers = new HashMap<>();

    for (Map.Entry<PDPage, Map<Integer, PDStructureNode>> entry : parentsByPage.entrySet()) {
        int parentsId = entry.getKey().getCOSObject().getInt(COSName.STRUCT_PARENTS);
        if (parentsId < 0) {
            System.err.printf("Page without StructsParents. Ignoring %s MCIDs.\n", entry.getValue().size());
        } else {
            if (parentTreeMaxkey < parentsId)
                parentTreeMaxkey = parentsId;
            COSArray array = new COSArray();
            for (Map.Entry<Integer, PDStructureNode> subEntry : entry.getValue().entrySet()) {
                array.growToSize(subEntry.getKey() + 1);
                array.set(subEntry.getKey(), subEntry.getValue());
            }
            numbers.put(parentsId, array);
        }
    }

    PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(PDParentTreeValue.class);
    numberTreeNode.setNumbers(numbers);
    root.setParentTree(numberTreeNode);
    root.setParentTreeNextKey(parentTreeMaxkey + 1);
}

(RebuildParentTreeFromStructure method)

Applied like this

PDDocument document = PDDocument.load(SOURCE));
rebuildParentTree(document);
document.save(RESULT);

(RebuildParentTreeFromStructure test testTestdatei)

PAC3 and Adobe Preflight (at least of my old Acrobat 9.5) go all green for the result:

Beware: This is no generic parent tree rebuilder yet. It is made to work for the test file at hand with a specific kind of structure tree nodes and content only in page content streams. For a generic tool it has to learn to cope with other kinds, too, and to also process e.g. marked content in embedded XObjects.

Hi @mkl, thanks for this great answer. Just when you posted it, we came up with an own solution, that generated a valid ParentTree. So there is no need to rebuild it any more. But again, your answers pointed in the right direction and helped us a great deal. We didn't test the tree rebuilder by our selves but i will mark your answer as solution. — rsr03, Dec 19 '19 at 07:52
If you don't mind, i will post another answer later with the changes we made und the resulting pdf. We found this solution more or less by comparing our ParentTree with a ParentTree of a valid document (the PAC3 report doc btw). But we still don't really understand, which part we missed in the documentation that makes the difference from a valid to a non valid ParentTree. — rsr03, Dec 19 '19 at 08:05
@rsr03 *"If you don't mind, i will post another answer later with the changes we made und the resulting pdf."* - Yes, of course, do so! The solution of your problem must be a fix of your existing code, not a post-processing step fixing broken output by prior steps. The code in my answer could merely serve as a work-around until the final fix could be deployed. — mkl, Dec 19 '19 at 10:47

score 1 · Answer 2 · edited Dec 19 '19 at 13:19

Thanks to the comments of @mkl we have analyzed our solution over and over again. In our first approach we followed the example of this post from @GurpusMaximus and his GitHub repo. Thanks also to @GurpusMaximus for a complete example code! But obviously we did not find the right strategy for creating the parent tree in the PDFormBuilder.addContentToParent(...) method for our data. There in line 206 for each MarkedContent element a new COSDictionary is added. This led us to create a deeply branched structure tree where there is also a structuring within the parent tree.

In a final step, we added numDictionaries to the ParentTree as suggested in step 3 of this post.

This resulted in the odd parent tree seen in our first example file.

The comparison with the parent tree of a valid PDF (the PAC3 report pdf) has shown that there is only a flat tree structure which only holds a reference to the parent structure element or parent tree element for each MarkedContent element.

We changed addContentToParent to the following form:

public PDStructureElement addContentToParent(COSName name, String type,
        PDStructureElement parent) {

    PDStructureElement parentElem = parent;
    if (parentElem == null) {
        parentElem = currentElem;
    }

    PDStructureElement structureElement = null;
    if (type != null) {
        structureElement = new PDStructureElement(type, parentElem);
        structureElement.setPage(qrbill.getPage(0));
    }

    if (name != null) {
        if (structureElement != null) {
            if (!COSName.ARTIFACT.equals(name)) {
                structureElement.appendKid(new PDMarkedContent(name,
                        currentMarkedContentDictionary));
            } else {
                structureElement.appendKid(new PDArtifactMarkedContent(
                        currentMarkedContentDictionary));
            }
            numDictionaries.add(structureElement.getCOSObject());
        } else {
            if (!COSName.ARTIFACT.equals(name)) {
                parentElem.appendKid(new PDMarkedContent(name,
                        currentMarkedContentDictionary));
            } else {
                parentElem.appendKid(new PDArtifactMarkedContent(
                        currentMarkedContentDictionary));
            }
            numDictionaries.add(parentElem.getCOSObject());
        }
        currentStructParent++;
    }

    if (structureElement != null) {
        parentElem.appendKid(structureElement);
        if (name == null && !type.matches("H[1-9]?")) {
            currentElem = structureElement;
        }
    }

    return structureElement;
}

You can see, that we only add an element to numDictionaries if we have marked content that is directly inside a structure element or inside a parent element. This gives us a flat hierarchy without unnecessary in between elements as suggested by @mkl in the accepted answer.

After we did that, we had no errors in the PAC3 check any more. The preflight check still complained about a wrong array size which we healed by changing the addParentTree method like this:

public void addParentTree() {
    final COSDictionary dict = new COSDictionary();
    nums.add(numDictionaries);
    dict.setItem(COSName.NUMS, nums);

    final PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(dict,
            dict.getClass());
    qrbill.getDocumentCatalog().getStructureTreeRoot()
            .setParentTreeNextKey(currentStructParent);
    qrbill.getDocumentCatalog().getStructureTreeRoot()
            .setParentTree(numberTreeNode);
    qrbill.getDocumentCatalog().getStructureTreeRoot().appendKid(rootElem);
}

Now, our example file changed to something like this.

We have been reading chapter 14.7.4.4 in the pdf reference over and over again but we still can't find the point where missed something.

The parent tree is a number tree (see 7.9.7, “Number Trees”), accessed from the ParentTree entry in a document’s structure tree root (Table 322). The tree shall contain an entry for each object that is a content item of at least one structure element and for each content stream containing at least one marked-content sequence that is a content item. The key for each entry shall be an integer given as the value of the StructParent or StructParents entry in the object (see Table 326).

Maybe it's just my bad English but i can't see why deeply structured parent trees are bad.

Thanks again for your help @mkl and for the example implementation @GurpusMaximus!!

*"we still can't find the point where missed something."* - see Table 322 – Entries in the structure tree root –, the entry for **ParentTree**: *For a page object or content stream containing marked-content sequences that are content items, the value shall be an array of references to the parent elements of those marked-content sequences.* Here "parent elements" means parent elements in the structure tree, not some separate parents not present in the structure tree. Also read the first paragraph of 14.7.4.4, the purpose of the parent tree is a way to get from MCID to structure tree parent. — mkl, Dec 19 '19 at 13:38

How to heal inconsistent parent tree mappings in a PDF created by pdfBox

2 Answers2

Linked