PDFBox delete comment maintain strikethrough

Question

I have a PDF which has a comment on a paragraph. This paragraph is strickedthrough. My requirement is to delete the command from a specific page.

The following code should delete a specific comment from my PDF but it does not.

PDDocument document = PDDocument.load(...File...);
List<PDAnnotation> annotations = new ArrayList<>();
PDPageTree allPages = document.getDocumentCatalog().getPages();

for (int i = 0; i < allPages.getCount(); i++) {
    PDPage page = allPages.get(i);
    annotations = page.getAnnotations();

    List<PDAnnotation> annotationToRemove = new ArrayList<PDAnnotation>();

    if (annotations.size() < 1)
        continue;
    else {
        for (PDAnnotation annotation : annotations) {

            if (annotation.getContents() != null && annotation.getContents().equals("Sample Strikethrough")) {
                annotationToRemove.add(annotation);
            }
        }
        annotations.removeAll(annotationToRemove);
    }
}

What is the best way to remove a specific comment and maintain a strikethrough on the text that the comment was appliaed?

Can you share a sample PDF? That been asked, to *remove the comment but maintain the strikethrough* one apparently shall not remove the annotation (which most likely is a **StrikeOut** annotation) but the **Popup** it references. — mkl, Aug 22 '17 at 08:50
Sure. The file can be downloaded from here: https://expirebox.com/files/3d955e6df4ca5874c38dbf92fc43b5af.pdf . The text that is strokeout has also the comment. But deleting the annotation seems to remove the strike from the text. So i guess i am going the wrong way with my approach. I modified my "if" condition in my code sample to identify when to remove the comment-annotation. Thank you — Stephan, Aug 22 '17 at 09:00
I am not sure if the link above is working. I am providing another one for the pdf file : https://file.io/DTvqhC — Stephan, Aug 22 '17 at 09:06
The first link works, one merely has to find the correct download link. The second one is better as there is nothing one can do wrong. — mkl, Aug 22 '17 at 09:16
Have you tried your code with your example file? I ran it and it changed nothing! (Which would look like a PDFBox bug...) — mkl, Aug 22 '17 at 10:21
You are totally correct. I was almost sure that i generated a pdf without the comment. Apparently not... It seems that the above code does not even remove the comment as it should. I am modifying the question — Stephan, Aug 22 '17 at 10:40
Ok, that is ok. I think as a side effect we have identified a bug in PDFBox here, `annotations.removeAll` only works if the annotations to remove are direct objects. In your sample document they are indirect objects but probably you had tested before with a document in which they are direct objects, so that prior test worked as you originally described. — mkl, Aug 22 '17 at 10:45
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/152515/discussion-between-stephan-and-mkl). — Stephan, Aug 22 '17 at 10:47

mkl · Accepted Answer · 2017-08-22T13:25:50.800

What is the best way to remove a specific comment and maintain a strikethrough on the text that the comment was appliaed?

The annotation you found actually is a text markup annotation of subtype StrikeOut, i.e. the main appearance of this annotation is the strikethrough. Thus, you must not remove this annotation. Instead you should remove the data from which the additional appearance of the annotation, the hover text, is generated.

This can be done like this:

final COSName POPUP = COSName.getPDFName("Popup");

PDDocument document = PDDocument.load(resource);
List<PDAnnotation> annotations = new ArrayList<>();
PDPageTree allPages = document.getDocumentCatalog().getPages();

List<COSObjectable> objectsToRemove = new ArrayList<>();

for (int i = 0; i < allPages.getCount(); i++) {
    PDPage page = allPages.get(i);
    annotations = page.getAnnotations();

    for (PDAnnotation annotation : annotations) {
        if ("StrikeOut".equals(annotation.getSubtype()))
        {
            COSDictionary annotationDict = annotation.getCOSObject();
            COSBase popup = annotationDict.getItem(POPUP);
            annotationDict.removeItem(POPUP);            // popup annotation
            annotationDict.removeItem(COSName.CONTENTS); // plain text comment
            annotationDict.removeItem(COSName.RC);       // rich text comment
            annotationDict.removeItem(COSName.T);        // author

            if (popup != null)
                objectsToRemove.add(popup);
        }
    }

    annotations.removeAll(objectsToRemove);
}

(RemoveStrikeoutComment.java test testRemoveLikeStephanImproved)

As a side effect of looking into this a PDFBox bug became apparent: The original code by the OP should have removed the StrikeOut annotation completely but it did nothing. The reason is a bug in the usage of the COSArrayList class in the context of page annotations.

The page annotation list returned by page.getAnnotations() is an instance of COSArrayList. This class carries both a list of COS objects as they appear in the page Annots array and a list of wrappers for those entries (after resolving indirect references where necessary).

The removeAll method (sensibly) checks its argument collection for such wrappers and removes the actual COS objects, not the wrappers, from the former collection and the argument collection as is (i.e. with wrappers) from the latter.

This works well for direct objects in the Annots array, but entries in the former list which are indirect references aren't properly removed as the code tries to remove the resolved annotation dictionaries while that list actually contains indirect references.

In the case at hand that results in removals not being written back. In more generic situations the results can even be weirder as the two lists have different sizes now. Index oriented methods, therefore, can now manipulate non-corresponding objects of the lists...

(BTW, in my code above I remove an indirect reference, not a wrapper, leaving the lists in disarray, too, as this time only an entry of the former, not the latter list is removed; probably this should also be handled more securely.)

A similar problem occurs in the retainAll method.

Another glitch: COSArrayList.lastIndexOf uses indexOf of the contained list.

The PDFBox source this has been analysed with is the current 3.0.0-SNAPSHOT, but the error occurs with all versions 2.0.0 - 2.0.7, so their code very likely contains these errors, too.

@TilmanHausherr While the issue at hand can also be solved differently (e.g. by allowing the annotation wrappers to be based in indirect objects, too), `COSArrayList` IMO should really be overhauled: Looking through its code once again I can imagine very many situations that will bring the two contained lists out of synch. — mkl, Aug 23 '17 at 12:21

PDFBox delete comment maintain strikethrough

1 Answers1

Linked