"Find Tag from Selection" is not working in tagged pdf?

Question

I have tagged a pdf using pdfbox.

How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC) and then I am adding that marked content to document root catalog structure.

What working: Almost everything is working fine like completely tagged pdf. It is passing the PAC3 accessibility checker also.

//Adding tags
tokens.add(++ind, type_check(t_ype, page));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
if (altText != null && !altText.isEmpty()) {
    currentMarkedContentDictionary.setString(COSName.ALT, altText);
}
mcid++;
tokens.add(++ind, currentMarkedContentDictionary);
tokens.add(++ind, Operator.getOperator("BDC"));

// Adding marked content to root structure
structureElement.appendKid(markedContent);

currentSection.appendKid(structureElement);

What not working: After tagging one future Is missing from tag structure. There is an option called "Find Tag from Selection" . Is not working. It is going to last tag while I select some test and press " Find tag from selection" in root structure. Please find the pdf in below link.

https://drive.google.com/file/d/11Lhuj50Bb9kChvD0kL_GOHQn4RNKZ0hR/view?usp=sharing

Parent tree:

https://drive.google.com/file/d/109xhUpqsQSFLPJB2nhXoU9ssMKnyht3G/view?usp=sharing

extra doc with tagging and parent tree: https://drive.google.com/file/d/1yzZSsjkb5_dGfq1Wu3VxsH73vr3alRmC/view?usp=sharing

Please help me to solve this problem.

New Problem: I observed that

while Jaws reading my tagged document, I am pressing controls like ctl+shift+5 in windows machine . It will show the options like drop down>"Read based on tagged structure" or >"Top left to bottom right" and below two radio buttons

Read curent page Read all pages image you can see. Shift+CTL+5 in adobe dc you can see image here

I selected "read based on tagging structure and Read current page" Now the jaws not reading the Tag structure. But if i use same doc for "Read entire document" it is reading perfect?

Link to doc:

https://drive.google.com/file/d/1CguMHa4DikFMP15VGERnPNWRq5vO3u6I/view?usp=sharing

Any help?

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

A nesting issue

How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC)

You're doing this incorrectly. See for example the start of the page content stream in your document:

BT
0 i
/C0_0 18 Tf
41.91 740.175 Td
/H2 <</MCID  0  >> BDC
( \) F M M P  8 P S M E) Tj
ET
/TouchUp_TextEdit MP
BT
/C0_1 14 Tf
EMC

Focusing on the beginning and end of text objects and marked content, we see that you have BT ... BDC ... ET ... BT ... EMC

According to the specification, though:

When the marked-content operators BMC, BDC, and EMC are combined with the text object operators BT and ET (see 9.4, “Text Objects”), each pair of matching operators (BMC…EMC, BDC…EMC, or BT…ET) shall be properly (separately) nested. Therefore, the sequences
BMC             BT
  BT              BMC
    …    and         …
  ET              EMC
EMC             ET
are valid, but
BMC             BT
  BT              BMC
    …    and         …
  EMC             ET
BT              EMC
are not valid.

(ISO 32000-1 section 14.6 "Marked Content")

This issue was fixed in the second shared PDF, res1.pdf.

Missing ParentTree and StructParents

The problem your question focuses on is

There is an option called "Find Tag from Selection" . Is not working.

Finding a tag from selection essentially means that you have the MCID of some content stream instruction and you search the structure element in the structure tree referencing that marked content ID.

How PDF processors are expected to do this, is described in section 14.7.4.4 "Finding Structure Elements from Content Items" of the PDF specification ISO 32000-1 (or section 14.7.5.4 in ISO 32000-2):

Because a stream cannot contain object references, there is no way for content items that are marked-content sequences to refer directly back to their parent structure elements (the ones to which they belong as content items). Instead, a different mechanism, the structural parent tree, shall be provided for this purpose. For consistency, content items that are entire PDF objects, such as XObjects, shall also use the parent tree to refer to their parent structure elements.

The parent tree is a number tree, accessed from the ParentTree entry in a document’s structure tree root. The tree shall contain an entry for each object that is a content item of at least one structure element and for each content stream containing at least one marked-content sequence that is a content item.

Your PDF does not have that ParentTree at all, and your page does not contain a StructParents entry to lookup in a parent tree. Thus, the prescribed way to get from marked content to the structure tree is impossible to go.

A ParentTree was added in the third shared PDF, new.pdf.

Incorrect ParentTree entries

While in new.pdf you have a ParentTree, its contents are clearly incorrect:

The ParentTree is a number tree, i.e. integers are mapped to something here, so there obviously must not be multiple entries for the same integer key.

Furthermore, looking inside one of those values:

one sees that you claim that the following StructElem is the value for all marked content IDs:

Inspecting this StructElem further, one sees that it represents the final paragraph on the final page.

Thus, your observation

Now instead of "selection not found " it is highlighting the last <P> tag in parent tree. Irrespective of what what we selected.

is what one can expect. If one expects any reasonable behavior at all, that is, with a ParentTree structure broken so badly.

Actually there was not only this new.pdf but also res.pdf and tagged without altext.pdf with ParentTrees, but all these ParentTrees were broken like the tree of new.pdf.

You might want to start inspecting the structures you create when analyzing an unwanted behavior.

Another issue with parent tree entries

The previously described issue in parent trees meanwhile has been resolved, different pages now have different struct parents and the parent tree arrays now reference the struct elements for distinct MCIDs.

For some documents a different error occurs now, though, e.g. "res29_08_19.pdf". Here the parent tree starts like this:

In particular the first entry in the array is for MCID 3, the second for MCID 4, ...

This is invalid, according to the specification

The array element corresponding to each sequence shall be found by using the sequence’s marked-content identifier as a zero-based index into the array.

(ISO 32000-1 section 14.7.4.4 "Finding Structure Elements from Content Items")

Thus, the first entry must be for MCID 0, the second for MCID 1, ...

You objected in a comment

No I used 0 and 1 Mcid's for Artifacts.

But as a corollary of the above: Do not give MCIDs to marked content sequences you don't have a structure element for! MCIDs are for going back and forth between the structure hierarchy and the content streams. If you mark a piece of content without having a structure element for it, don't give it a MCID.

Yet another issue with parent tree entries

You again report problems with your newest file mathpdf.pdf. And indeed, there are issues; Adobe Acrobat Preflight reports a 5 pages list of inconsistent parent tree mappings like this:

In contrast to the previous issues the cause does not become clear by looking at the parent tree alone, one also has to look at the structure hierarchy.

Doing so, though, one peculiarity immediately hits the eye: In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.

For example let's look at the MCID 0 on the first page. In the structure hierarchy you have:

In the parent tree you have:

You should have simply referenced object 238 (the structure hierarchy parent of MCID 0) directly from the parent tree array for page one instead of that in-between object 62 which claims to have that object 238 as parent and MCID 0 as kid.

The reported inconsistency may be due to the node referenced from the parent tree (in object 62) claims to be a P paragraph with a parent node (in object 238) which is a Span. That is not allowed, a paragraph may contain a span but it cannot be contained in one.

Ok, I adjusted my code and tried again. same result. Please see the screen shot. https://drive.google.com/file/d/1587Ih0cDDXPuxfA97EM3EZ4O-p7PKIMq/view?usp=sharing — fascinating coder, Aug 21 '19 at 14:18
@fascinatingcode Ok, I just realized your structure information do not contain a parent tree at all. That parent tree is required for going from content to structure, see my edited answer. — mkl, Aug 22 '19 at 13:04
Apologies , I forgot to add parent parent tree to the pdf. Now I added new document link in question which contains parent tree also(please confirm, whether it had parent tree contains or not in two pdf's). Now instead of "selection not found " it is highlighting the last
tag in parent tree. Irrespective of what what we selected. (try with select ' hello world' and press select tag from selection in first added pdf) — fascinating coder, Aug 24 '19 at 10:55
Your **ParentTree** entries are incorrect, see my edited answer. — mkl, Aug 26 '19 at 16:27
Could you please tell me what are the tools available for to inspect pdf parent tree and structparents. Please provide some links or help me to code(PDFBox) to form a parent tree structure without errors. — fascinating coder, Aug 27 '19 at 12:32
I don't know specific tools for only the structure elements. I used a generic tool to inspect PDF objects, iText RUPS; similar is PDFBox PDFDebugger; and there is a similar function in Adobe Preflight. — mkl, Aug 27 '19 at 14:11
Thanks a lot both bugs are fixed. You can see my parent tree now. https://drive.google.com/file/d/1_iaoPx0sEkaNjPDEUYuRLeApmOjYr_0o/view?usp=sharing — fascinating coder, Aug 28 '19 at 04:29
I tested my code by remediating different documents. Some documents are working fine but some are not why. below file link passing all checks and working very well while jaws reading. https://drive.google.com/file/d/109xhUpqsQSFLPJB2nhXoU9ssMKnyht3G/view?usp=sharing Below link still problems find tag selection and jaws read only current page not reading tag structure. https://drive.google.com/file/d/1KTwtyd4J1_hTx4o3xGoA-MR9GpqL8JqS/view?usp=sharing — fascinating coder, Aug 29 '19 at 10:10
@fascinatingcoder In "res29_08_19.pdf" you forgot the entries for the MCIDs 0 and 1 and others on both pages in the **ParentTree**. — mkl, Aug 30 '19 at 09:19
No I used 0 and 1 Mcid's for Artifacts. So H1 is started with 2. the page no and beside text I tagged like Artifacts. see below link. https://drive.google.com/file/d/1CDKy9M3r0ryDAp5tUpARlDOEkmQIC4w9/view?usp=sharing — fascinating coder, Aug 30 '19 at 10:00
See the specification, in these parent tree arrays, "The array element corresponding to each sequence shall be found by using the sequence’s marked-content identifier as a zero-based index into the array." Thus, the structure element for MCID 0 is referenced at position 0, the structure element for MCID 1 is referenced at position 1, .... Thus, you cannot leave them out. If you don't want structure elements for some marked content sequences, don't give MCIDs to those sequences. — mkl, Aug 30 '19 at 11:01
Hi @mkl, This time I tagged inline spans to
tag. Now the read single page option is not working? I don't know why. Can you please help. Parent tree is fine I think. https://drive.google.com/file/d/1aD1HGQsEXOovpfWdf7JRNwJhP7tX6pmy/view?usp=sharing — fascinating coder, Sep 07 '19 at 14:48
At the same above pdf in first page jaws read all pages mode if read paragraph by paragraph(by just pressing P or ctrl+down arrow) after the image immediate
tag is not reading and directly going to H3. But if you read top to bottom it is reading fine. — fascinating coder, Sep 09 '19 at 05:57
@fascinatingcoder I've looked a bit deeper into how your parent tree and your structure hierarchy relate, and only now I became aware that you introduced an unnecessary layer of in-between structure nodes probably confusing tagging-aware software. See the edit of my answer. — mkl, Sep 10 '19 at 09:22