1

Can we Create a new custom PDFOperator (like PDFOperator{BDC}) and COSBase objects(like COSName{P} COSName{Prop1} (again Prop1 will reference one more obj)) ? And add these to the root structure of a pdf?

I have read some list of parser tokens from an existing pdf document. I wanted to tag the pdf. In that process I will first manipulate the list of tokens with newly created COSBase objects. At last I will add them to root tree structure. So here how can I create a COSBase objects. I am using the code to extract tokens from pdf is

old_document = PDDocument.load(new File(inputPdfFile));
List<Object> newTokens = new ArrayList<>();
for (PDPage page : old_document.getPages()) 
{
    PDFStreamParser parser = new PDFStreamParser(page);
    parser.parse();
    List<Object> tokens = parser.getTokens();
    for (Object token : tokens) {
        System.out.println(token);
        if (token instanceof Operator) {
            Operator op = (Operator) token;     
        }
}
newTokens.add(token);
}

PDStream newContents = new PDStream(document);
document.addPage(page);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(newTokens);
out.close();
page.setContents(newContents);
document.save(outputPdfFile);
document.close();

Above code will create a new pdf with all formats and images. So In newTokens list contains all existing COSBase objects so I wanted to manipulate with some tagging COSBase objects and if I saved the new document then it should be tagged without taking care of any decode, encode, fonts and image handlings.

First Is this idea will work? If yes then help me with some code to create custom COSBase objects. I am very new to java.

  • I think the best strategy would be to take an existing minimal tagged PDF, analyze it with PDFDebugger, and try to reproduce that with PDFBox. I don't think we need a new class for this. – Tilman Hausherr May 21 '19 at 07:38
  • Maybe this helps? https://stackoverflow.com/questions/39872854/tagged-pdf-with-pdfbox – Tilman Hausherr May 21 '19 at 09:46
  • Or this one: https://stackoverflow.com/questions/49682339/how-can-i-create-an-accessible-pdf-with-java-pdfbox-2-0-8-library-that-is-also-v?rq=1 – Tilman Hausherr May 21 '19 at 09:54
  • Hi @TilmanHausherr, I just want to tag already existing untagged pdf document. If I used PDFDebugger and recreated the new pdf by using "currentContentStream" we have to handle all operators (Tj, TJ, TD, Td, Tm, Tw, Tc, cm, Do, p, r, w......, these operators structure will change from one pdf to other.) to get same pdf style. How can I tag existing pdf effectively? And How can I handle different type texts (type0, type1...). some is encoded completely? – Ravikumar gogineni May 22 '19 at 07:05
  • In below pdf they used type0 font. while I am reading each Tj it is encoded. So ContentStream.showtext("XMFJAJKFIANRKGIADFKEWIAFJA") won't work here how can I tag this document? https://drive.google.com/file/d/1VCXY9OlbZENE08Ztjcnv2pVUM1cRTxx7/view?usp=sharing – Ravikumar gogineni May 22 '19 at 07:29
  • I can't look at your file right now. PDF is not really made for editing. You can still do it, you'd have to insert BMC / BDC / EMC / MP / DP at the correct positions and then save this token list (see the `RemoveAllText.java` example). And of course also build the structure tree. – Tilman Hausherr May 22 '19 at 07:38
  • To see the needed content, extract the tokens of a file with tags (e.g. IRS form 1040). – Tilman Hausherr May 22 '19 at 08:56
  • @TilmanHausherr, Yes correct I know where to place BMC, MCID, EMC and /p or /H1. But How can I create object level new COSBase object and How can I write those to content stream like below BDC PDFOperator I can able to add. But How /p and /prop0? Ex: /p /prop0 /BDC 9 0 obj << /Prop2 10 0 R >> 10 0 obj << /MCID 0 >> – Ravikumar gogineni May 22 '19 at 09:33
  • /p is a COSName. Use the constants (if available), or generate with `COSName.getPDFName()`. BDC is an operator. The stuff with "<<" is a COSDictionary. – Tilman Hausherr May 22 '19 at 09:51
  • @TilmanHausherr, Yes, This is awesome. It worked I am able to add COSName object to newtokens. All required tokens to tag are added to content stream. How can I add the mcid related p to the tree structure. Any help? – Ravikumar gogineni May 22 '19 at 11:27
  • This is the content stream now for simple hello world pdf file. BT /p <>BDC /F1 24 Tf 175 720 Td (Hello World!)Tj EMC ET – Ravikumar gogineni May 22 '19 at 11:37
  • For the structure tree you should really look the two links I mentioned yesterday. When done, compare your result PDF with a model file in PDFDebugger to see what you missed. Important: COSString ("()" symbol) is not the same as COSName ("/" symbol). I also think your actual question (which did not mention the structure tree) has been answered thanks to the hints, I suggest you answer it yourself and include some actual code. – Tilman Hausherr May 22 '19 at 11:43
  • Got solution thanks @TilmanHausherr – Ravikumar gogineni Jun 06 '19 at 13:58

1 Answers1

1

Based on your document format you can insert marked content.

//Below code is to add   "/p <<MCID 0>> /BDC"

newTokens.add(COSName.getPDFName("P"));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
mcid++;
newTokens.add(currentMarkedContentDictionary);
newTokens.add(Operator.getOperator("BDC"));

// After adding mcid you have to append your existing tokens TJ , TD, Td, T* ....
newTokens.add(existing_token);
// Closed EMC
newTokens.add(Operator.getOperator("EMC"));
//Adding marked content to the root tree structure.
structureElement = new PDStructureElement(StandardStructureTypes.P, currentSection);
structureElement.setPage(page);
PDMarkedContent markedContent = new PDMarkedContent(COSName.P, currentMarkedContentDictionary);
structureElement.appendKid(markedContent);
currentSection.appendKid(structureElement);

Thanks to @Tilman Hausherr