0

I do the job watermark of remove. i faced a problem how to remove a sentence in pdf file. I hava an idea that when processing operator(TJ Tj '),i record the ordre of such operator(TJ Tj ' ... showIdx). when the need to be removed sentence was found, i found the order index of operator ,and reprocess content stream,delete them. the [op]<a https://stackoverflow.com/questions/58475104/filter-out-all-text-above-a-certain-font-size-from-pdf>[1] introduce PdfContentStreamEditor,but i can not get help from it.

BT    
Tj   showIdx2
TJ   showIdx2
、
ET

BT
Tj    showIdx3
TJ    showIdx4
、
ET
···
[the case pdf file]  <a https://github.com/zhongguogu/PDFBOX/blob/master/pdf/watermark.pdf >
the content in page header "本报告仅供-中庚基金管理有限公司-中庚报告邮箱使用 p2"
Dagu
  • 27
  • 7
  • Please share your PDF, because there are many ways to have a watermark. See also the RemoveAllText.java example in the source code download. – Tilman Hausherr Mar 26 '21 at 03:22
  • Do i understand you correctly? I understood that you have successfully removed watermarks from pdfs but now you want to remove arbitrary sentences from the content. If that's correct, what is the exact problem in doing so now? – mkl Mar 26 '21 at 05:43
  • @mkl yes, i can remove watermarks sentence in such one pdf file, but failed in other pdf file. The method I mentioned was not so good,sometimes it make wrong. I wonder if there are some good methods to remove arbitrary sentences from the content. – Dagu Mar 26 '21 at 05:55
  • @TilmanHausherr thanks, but class RemoveAllText remove all TJ Tj ' opeprators, i need to remove opertators which are matched with some sentence. – Dagu Mar 26 '21 at 05:57
  • You will obviously have to add some logic to identify your text. Look at your file with PDFDebugger. Btw if this is about removing watermarks from e-books - don't. – Tilman Hausherr Mar 26 '21 at 06:48
  • Indeed, if google translate doesn't betray me, that line says that "this report is only for-Zhong Geng Fund Management Co., Ltd.-Zhong Geng Report Mailbox". This quite likely means that the report indeed is for Zhong Geng eyes only. But let's assume they decided to publish those reports more widely and you have the task of removing that soft restriction. In that case how far did you get in removing that text? The `PdfContentStreamEditor` framework you mention should be usable for that task; what stopped you? – mkl Mar 26 '21 at 09:10

1 Answers1

1

According to Google translate that line says that "this report is only for-Zhong Geng Fund Management Co., Ltd.-Zhong Geng Report Mailbox". This quite likely means that the report indeed was for Zhong Geng eyes only. But let's assume they decided to publish those reports more widely and you have the task of removing that soft restriction.

You mentioned the PdfContentStreamEditor from this answer.

Indeed you can use it similar to how it has been used in this answer where a string "[QR]" was to be removed from underneath some QR codes:

PDDocument document = ...
for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
        final StringBuilder recentChars = new StringBuilder();

        @Override
        protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement)
                throws IOException {
            String string = font.toUnicode(code);
            if (string != null)
                recentChars.append(string);

            super.showGlyph(textRenderingMatrix, font, code, displacement);
        }

        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
            String recentText = recentChars.toString();
            recentChars.setLength(0);
            String operatorString = operator.getName();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString) && "本报告仅供-中庚基金管理有限公司-中庚报告邮箱使用 p2".equals(recentText))
            {
                return;
            }

            super.write(contentStreamWriter, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    editor.processPage(page);
}
document.save("watermark-RemoveByText.pdf");

(RemoveText test testRemoveByText)

Beware, though, this only works if the text to remove is drawn using one text showing instruction only and that instruction only draws the text to remove.

If instead the text to replace is drawn using multiple instructions following each other, you have to start collecting instructions as long as you have a potential match instead of dropping them immediately. As soon as the potential match turns out not to be a match after all, you'll have to super.write the collected instructions.

And if instead the text the replace is only part of what a single instruction draws, you'll have to doctor around with that instruction. Depending on one's script this may be very difficult, depending on how much it uses ligatures and stuff.

And the most complex situations may require you to collect all instructions while they're coming in, analyzing the whole of them, adapting identified instructions, and then forwarding the manipulated collected instructions to super.write.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks very much for your immediate help. – Dagu Mar 26 '21 at 15:55
  • when the whloe page content is in a big form object.content stream such as `q q BT 50 791.92 Td ET Q q 1 0 0 1 0 0 cm /Xf2 Do Q Q q q 0.97275 0 0 0.97275 0 35.75 cm Q Q` .All contents are in Xf2. Now it is difficult to remove sentence in form Xf2. – Dagu Mar 29 '21 at 08:14
  • 1
    Correct. Essentially you also have to apply the `PdfContentStreamEditor` not only to the page in question but also to the XObjects and Patterns of the page (and that recursively). You can simply add a method similar to `processPage` which processes a form XObject and creates a replacement for it. – mkl Mar 29 '21 at 10:26
  • Thanks very much. May I create a replacement for a whole form Xobject or formXObject's content stream? I find it's hard to update formObject's content stream. – Dagu Mar 30 '21 at 06:38
  • 1
    Indeed, I would create a *new* form XObject and *replace the reference* to the old one in the page resources by a reference to the new one. – mkl Mar 30 '21 at 06:40
  • Create a new form XObject and replace the reference do well.Thanks a lot. – Dagu Mar 31 '21 at 01:14