1

I want to do tagging for a pdf file using Apache PDFBox. How can I do that? Is there any standard PDFStream structure need to be maintained to tag the PDF?.

I tried the below code.

PDDocument doc1 = PDDocument.load(new File("filename.pdf"));    
PDMarkInfo markInfo = new PDMarkInfo();
markInfo.setMarked(true);
doc1.getDocumentCatalog().setMarkInfo(markInfo);

PDDocumentCatalog cat1 = doc1.getDocumentCatalog();
PDPageTree pages = cat1.getPages();

PDPage page1 = pages.get(0);
PDFStreamParser sParser = new PDFStreamParser(page1);
sParser.parse();
List<Object> tokens = sParser.getTokens();
int size = tokens.size();

tokens.add(size, Operator.getOperator("BT"));
tokens.add(size, new COSFloat("11.04"));
tokens.add(size, Operator.getOperator("Tf"));
tokens.add(size, COSNumber.get("1"));
tokens.add(size, COSNumber.get("10"));
tokens.add(size, COSNumber.get("10"));
tokens.add(size, COSNumber.get("1"));
tokens.add(size, COSNumber.get("36"));
tokens.add(size, COSNumber.get("745"));
tokens.add(size, Operator.getOperator("Tm"));
tokens.add(size, COSName.P);
COSDictionary dictToken = new COSDictionary();
dictToken.setInt(COSName.MCID, 9000);
tokens.add(size, dictToken);
tokens.add(size, Operator.getOperator("BDC"));
tokens.add(size, new COSString(" "));
tokens.add(size, Operator.getOperator("Tj"));
tokens.add(size, Operator.getOperator("EMC"));
tokens.add(size, Operator.getOperator("ET"));

//updated the PDStream and save the document.


Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • Does this answer your question? [Tagged PDF with PDFBox](https://stackoverflow.com/questions/39872854/tagged-pdf-with-pdfbox) – Martin_0 Nov 07 '22 at 10:27
  • No, @Martin_0. The sample code is not working also the GitHub link doesn't have any information. – Nambi Rajan.P Nov 10 '22 at 08:18

0 Answers0