I want to do tagging for a pdf file using Apache PDFBox. How can I do that? Is there any standard PDFStream structure need to be maintained to tag the PDF?.
I tried the below code.
PDDocument doc1 = PDDocument.load(new File("filename.pdf"));
PDMarkInfo markInfo = new PDMarkInfo();
markInfo.setMarked(true);
doc1.getDocumentCatalog().setMarkInfo(markInfo);
PDDocumentCatalog cat1 = doc1.getDocumentCatalog();
PDPageTree pages = cat1.getPages();
PDPage page1 = pages.get(0);
PDFStreamParser sParser = new PDFStreamParser(page1);
sParser.parse();
List<Object> tokens = sParser.getTokens();
int size = tokens.size();
tokens.add(size, Operator.getOperator("BT"));
tokens.add(size, new COSFloat("11.04"));
tokens.add(size, Operator.getOperator("Tf"));
tokens.add(size, COSNumber.get("1"));
tokens.add(size, COSNumber.get("10"));
tokens.add(size, COSNumber.get("10"));
tokens.add(size, COSNumber.get("1"));
tokens.add(size, COSNumber.get("36"));
tokens.add(size, COSNumber.get("745"));
tokens.add(size, Operator.getOperator("Tm"));
tokens.add(size, COSName.P);
COSDictionary dictToken = new COSDictionary();
dictToken.setInt(COSName.MCID, 9000);
tokens.add(size, dictToken);
tokens.add(size, Operator.getOperator("BDC"));
tokens.add(size, new COSString(" "));
tokens.add(size, Operator.getOperator("Tj"));
tokens.add(size, Operator.getOperator("EMC"));
tokens.add(size, Operator.getOperator("ET"));
//updated the PDStream and save the document.