getting /u0000 while replacing a string in pdf using pdfbox

Question

I am getting a really rare issue. I am creating a PDF from HTML using wkhtmlTopdf and getting a nicely-created pdf.

But when I want to replace a word using pdfbox in the same string I am not able to do that.

why: because I am getting null character while reading the content from Operators.

My Code:

protected static void replaceText(String word) throws IOException, COSVisitorException {
        PDPage page = page1; // page1 is a variable which I assigns at class level
        PDStream contents = page.getContents();
        PDFStreamParser parser = new PDFStreamParser(contents.getStream());
        parser.parse();
        List tokens = parser.getTokens();

        for(int i = 0; i < tokens.size(); i++){
            Object next = tokens.get(i);
            if(next instanceof PDFOperator){
                PDFOperator operator = (PDFOperator) next;
                if (operator.getOperation().equals("Tj")) {
                    COSString previous = (COSString) tokens.get(i - 1);
                    String string = previous.getString();//here i am getting /u0000 which is null
                    List<String> listOfStrings = Arrays.asList(string.split(" "));
                    if(listOfStrings.contains(word)) {

                        string = string.replaceFirst(word, ""); 
                        previous.reset();
                        previous.append(string.getBytes(StandardCharsets.ISO_8859_1));
                    }
                }else if (operator.getOperation().equals("TJ")) {

                    COSArray previous = (COSArray) tokens.get(i - 1);
                    for (int k = 0; k < previous.size(); k++) {

                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();// same here
                            List<String> listOfStrings = Arrays.asList(string.split(" "));
                            if(listOfStrings.contains(word)) {
                                System.out.println(string);
                                string = string.replaceFirst(word, "");
                                cosString.reset();
                                cosString.append(string.getBytes(StandardCharsets.ISO_8859_1));
                            }
                        }

                    }

                }
            }
        }

        PDStream updatedStream = new PDStream(document);
        OutputStream outputStream = updatedStream.createOutputStream();
        ContentStreamWriter tokenWriter = new ContentStreamWriter(outputStream);
        tokenWriter.writeTokens(tokens);
        page.setContents(updatedStream);
        document.save(staticFileName);
    }

I am using pdfbox 1.8.6 which is the limitation for me.

I have tested this code on other pdfs(which are not created by wkhtmltopdf) and it works fine.

mkl · Answer 1 · 2020-04-08T15:08:45.387

1

It is completely normal that string operands of text drawing operators like Tj contain bytes with value 0.

Your code only works for special pdfs which use fonts with an ASCII'ish encoding (like WinAnsiEncoding) for the text to replace and also meet some other preconditions.

A generic solution to remove specific words from a pdf is somewhere between very complicated and not automatically possible.

The string operands of text drawing operators consist of bytes encoded according to the Encoding entry of the current font.

This encoding may resemble something common, something ASCII'ish like WinAnsiEncoding; but it may also be something completely different. Often ad-hoc encodings are used, e.g. if the text on the page shows "Test text", an encoding mapping 0 to 'T', 1 to 'e', 2 to 's', 3 to 't', 4 to ' ', and 5 to 'x' may be used and the string for drawing that text would consist of the bytes 0, 1, 2, 3, 4, 3, 1, 5, and 3.

Thus, in general you need to keep track of the current font and use information from it to decode the string arguments

edited Apr 08 '20 at 15:08

answered Mar 01 '20 at 23:44

mkl

90,588
15
125
265

sorry @mkl, couldn't get your point, how should I resolve this issue. – Deepak Singh Mar 03 '20 at 04:40
The point is that your code only works for very simple pdfs. You should restrict your use case to such very simple pdfs. A solution working for arbitrary pdfs is somewhere between very complicated and not automatically possible. You may try and extend your code to work with *a few* more pdfs by tracking not only text drawing instructions but also font selection and graphics state saving/restoring ones. Then you can try and interpret the string value according to the encoding of the current font. – mkl Mar 03 '20 at 06:02
See also https://pdfbox.apache.org/2.0/migration.html#why-was-the-replacetext-example-removed – Tilman Hausherr Mar 03 '20 at 11:51

getting /u0000 while replacing a string in pdf using pdfbox

1 Answers1