2

I'm trying to use PDFBOX 2.0 to replace empty or delete a text pattern, (in my case i want to remove all "[QR]" words from all PDF), but I can't find anything that works for me.

I tried itext, but the same, nothing works.

The "[QR]" string from my pdf were edited after the PDF was created, maybe that's why they don't appear as tj operators?

My main:

replaceText(documentoPDF, "[QR]", "");

My method (i printed Tj values and my pattern dont appear there):

public void replaceText(PDDocument documentoPDF, String searchString, String replacement) throws IOException{

    for ( PDPage page : documentoPDF.getPages()){
        
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List<?> tokens = parser.getTokens();
        
        for (int j = 0; j < tokens.size(); j++){
            
            Object next = tokens.get(j);
            if (next instanceof Operator){
                Operator op = (Operator) next;
                
                String pstring = "";
                int prej = 0;
                
                //Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) 
                {
                    // Tj takes one operator and that is the string to display so lets update that operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else 
                if (op.getName().equals("TJ")) 
                {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) 
                    {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) 
                        {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();
                            
                            if (j == prej) {
                                pstring += string;
                            } else {
                                prej = j;
                                pstring = string;
                            }
                        }                       
                    }                        
                    
                    System.out.println(pstring.trim());
                    
                    if (searchString.equals(pstring.trim())) 
                    {                            
                        COSString cosString2 = (COSString) previous.getObject(0);
                        cosString2.setValue(replacement.getBytes());                           

                        int total = previous.size()-1;    
                        for (int k = total; k > 0; k--) {
                            previous.remove(k);
                        }                            
                    }
                }
            }
        }
        
        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(documentoPDF);
        OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);            
        out.close();
        page.setContents(updatedStream);
    }

    documentoPDF.save("resources\\resultado\\nuevo.pdf");
}

This is an example of pdf with some [QR] patterns: http://www.mediafire.com/file/9w3kkc4yozwsfms/file

If someone can help, i will appreciate it.

I can upload my entire project if you need

Thanks in advance.

André Lemos
  • 837
  • 6
  • 18
Baldur
  • 23
  • 1
  • 5
  • 1
    The reason why that doesn't work is simple - you completely ignore the encoding of the font of that text. In the content stream there actually are `[( >) ( 4) ( 5) ( @) ] TJ` instructions (The "spaces" before '>', '4', '5', and '@' actually are zero bytes, 0x00). Thus, apparently the encoding is some 16bit encoding which also does not have ASCII naturally embedded. – mkl Aug 26 '20 at 08:46
  • so, it's imposible to do what i'm trying to do? I'm noob in pdfbox i dont know how to work with other encoding or cast it :( – Baldur Aug 26 '20 at 20:52
  • It is not impossible. At least usually not; there are PDFs with incomplete or incorrect information for text extraction which makes your task in general impossible. For other PDFs it merely is more complicated than your approach. – mkl Aug 26 '20 at 21:09
  • So, any help with that please? How can I recode that, and be able to take out and compare its plain text and thus replace it? – Baldur Aug 27 '20 at 07:34

1 Answers1

5

As already mentioned in comments, the reason why your code doesn't work is simple - you completely ignore the encoding of the font of that text. In the content stream there actually are [( >) ( 4) ( 5) ( @) ] TJ instructions (The "spaces" before '>', '4', '5', and '@' actually are zero bytes, 0x00). Thus, apparently the encoding is some 16bit encoding which additionally does not have ASCII naturally embedded.

To properly take the font into account one has to keep track of the current font. This means parsing the whole content stream and analyzing text font setting calls, save graphics state calls, and restore graphics state calls. Then you have to retrieve the proper font object from the correct resources.

All this actually is already done by the PDFBox content parsing framework used for e.g. text extraction. Thus, we can create a content stream editor around this framework.

Actually, this also has already been done, see the PdfContentStreamEditor from this answer.

As in case of your document the text pieces to delete are drawn by a single text drawing instruction each and each of these instructions draws only a text piece to remove, we can simply look at the text the current instruction draws and then decide whether to keep the instruction or not:

PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
    PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
        final StringBuilder recentChars = new StringBuilder();

        @Override
        protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, Vector displacement)
                throws IOException {
            String string = font.toUnicode(code);
            if (string != null)
                recentChars.append(string);

            super.showGlyph(textRenderingMatrix, font, code, displacement);
        }

        @Override
        protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
            String recentText = recentChars.toString();
            recentChars.setLength(0);
            String operatorString = operator.getName();

            if (TEXT_SHOWING_OPERATORS.contains(operatorString) && "[QR]".equals(recentText))
            {
                return;
            }

            super.write(contentStreamWriter, operator, operands);
        }

        final List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
    };
    editor.processPage(page);
}
document.save("nuevo-noQrText.pdf");

(EditPageContent test testRemoveQrTextNuevo)

Depending on your PDFBox version the showGlyph method to override may have a fifth parameter; thus, please check the showGlyph signature of your PDFBox copy and adapt if this code does not work. Thanks to @DanielNorberg for the hint!

In the result the "[QR]" texts underneath the QR codes have vanished, e.g.

source

became

result

mkl
  • 90,588
  • 15
  • 125
  • 265
  • I tried the code in this answer together with your PdfContentStreamEditor and it works like a charm, except for that I had to add String as the fourth parameter to showGlyph. I am using PDF box 2.0.19 and showGlyph takes five parameters: PDFStreamEngine.showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement). Maybe that method changed since you wrote the answer? However after that change I can successfully remove texts from the PDF without seemingly altering the layout of other content. Thanks for sharing your knowledge here, it is very helpful! – Daniel Norberg Aug 16 '21 at 07:58
  • @DanielNorberg *"I am using PDF box 2.0.19 and showGlyph takes five parameters"* - ah, good find; in the 3.0.0-SNAPSHOT development branch one parameter (which was not used anymore in the code) has been removed. – mkl Aug 16 '21 at 09:04