7

Is it possible to create tagged PDF(PDF/UA) with PDFBox? It looks like PDFBox has an API for that (package org.apache.pdfbox.pdmodel.documentinterchange.taggedpdf), but I can't find any tutorials or code examples.

Using the code below, I generated a PDF file containing an image, and the screen reader NVDA (in my case) recognizes it and reads '... graphic Alternate Description'. However, the accessibility checker PAC 2 shows an error: 'Image object not tagged'.

        PDDocument doc = new PDDocument();
        PDPage page = new PDPage();
        doc.addPage(page);
        PDDocumentCatalog documentCatalog = doc.getDocumentCatalog();

        PDImageXObject pdImage = PDImageXObject.createFromFile(imagePath, doc);
        PDPageContentStream contents = new PDPageContentStream(doc, page);
        contents.drawImage(pdImage, 100, 600, pdImage.getWidth() / 2, pdImage.getHeight() / 2);
        contents.close();

        PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
        PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure, treeRoot);
        structureElement.setPage(page);

        PDMarkedContent markedImg = new PDMarkedContent(COSName.IMAGE, new COSDictionary());
        markedImg.addXObject(pdImage);

        structureElement.appendKid(markedImg);
        structureElement.setAlternateDescription("Alternate Description");
        treeRoot.appendKid(structureElement);
        documentCatalog.setStructureTreeRoot(treeRoot);
        // ....
        doc.save(fileName);

Can you provide some explanations or/and code examples about this subject?

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
Leonid Muzyka
  • 101
  • 1
  • 4
  • There are no examples, sadly, mostly because none of us is involved with creating such files, AFAIK. (I am a PDFBox committer) The only thing I can do for you is to fix any bugs you may find. What you could do is to create a file with a different tool, then use PDFBox PDFDebugger to see the structure and reproduce it. – Tilman Hausherr Oct 05 '16 at 17:22
  • @TilmanHausherr , thanks for PDFDebugger. The main question now is how to write `PDStructureElement` directly in `PDPageContentStream`. – Leonid Muzyka Oct 07 '16 at 11:40
  • I assume you mean BMC, BDC, EMC, MP, DP. At this time you'd need to use the (deprecated) "raw" methods. Or you create a request in JIRA for some new methods :-) – Tilman Hausherr Oct 07 '16 at 11:46
  • PDFBox 1.8 can create PDF/A, but [only PDF/A-1b](https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html), not PDF/A-1a, which also covers PDF/UA. I haven't been able to find out if PDFBox 2.0 supports PDF/A-1a. If a PDF/A document generated with PDFBox 2 does not have accessibility tags, I would assume it is not supported yet? – Tsundoku Oct 25 '16 at 15:34
  • @leomuz, do you have acrobat? you can run the accessibility checker within acrobat to see if it has the same error as pac2. you can also look at the tag tree (view > show/hide > nav panes > tags). if you don't have acrobat, you can contact me offline and i can take a look at your file. look at my stackoverflow profile to see how to contact me. i can't help with pdfbox but perhaps seeing where the error is might help. – slugolicious Oct 25 '16 at 17:23
  • OpenHTMLtoPDF now has tagged PDF support. See the accessible PDF wiki page at: https://github.com/danfickle/openhtmltopdf/wiki/PDF-Accessibility-(PDF-UA,-WCAG,-Section-508)-Support – Daniel F Jun 26 '19 at 15:39

1 Answers1

4

I put up a working example which demonstrates creating an accessible PDF using PDFBox 2: https://github.com/martinlovell/accessible-pdfbox-example

There are a few things missing from the code in the question. The marked content needs alt text, and I believe you need mcids for that marked content.

The example project demonstrates in more detail what you need.

It would be something like this:

PDPageContentStream contents = new PDPageContentStream(doc, page);

// the content in the stream needs an id
int mcid = 5;
COSDictionary dictionary = new COSDictionary();
dictionary = new COSDictionary();
dictionary(COSName.MCID, mcid);

// wrap image drawing in marked content
contents.beginMarkedContent(COSName.IMAGE, PDPropertyList.create(dictionary));
contents.drawImage(pdImage, 100, 600, pdImage.getWidth() / 2, pdImage.getHeight() / 2);
contents.endMarkedContent();

contents.close();

PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
documentCatalog.setStructureTreeRoot(treeRoot);
PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure, treeRoot);
structureElement.setPage(page);
structureElement.setAlternateDescription("Alternate Description");

// Set alt text on marked content for structure.  
// This is the dictionary with the mcid used in beginMarkedContent.
dictionary.setString(COSName.ALT, "Alternate Description");
PDMarkedContent markedImg = new PDMarkedContent(COSName.IMAGE, dictionary);
markedImg.addXObject(pdImage);
structureElement.appendKid(markedImg);
mlovell
  • 41
  • 3
  • 3
    link to the accessible PDF box example is 404, can you help us find where you moved that to? Specifically I am interested in inserting a bit of text tagged to be ignored (the accessibility standard seems to be to ignore page numbers) Thanks! – DavesPlanet Dec 05 '22 at 21:55
  • @DavesPlanet this is the archived repo page, but the source files have not been archived: https://web.archive.org/web/20201107152042/https://github.com/martinlovell/accessible-pdfbox-example – Doc Jul 13 '23 at 13:36