Apache PDFBox replace text results in few character missed

Question

Trying to use Apache PDFBox version 2.0.2 for a text replace (with the below code) produces an output where few of the characters would not be displayed, mostly the capital Case Character. For example a replacement with "ABCDEFGHIJKLMNOPQRSTUVWXYZ" the output appears in pdf as "ABCDEF HIJKLM OP RST W Y ". Is this some bug ?? or we have some workaround to handle these character .

public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
    if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
        return document;
    }
    PDPageTree pages = document.getDocumentCatalog().getPages();
    for (PDPage page : pages) {
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List tokens = parser.getTokens();
        for (int j = 0; j < tokens.size(); j++) {
            Object next = tokens.get(j);
            if (next instanceof Operator) {
                Operator op = (Operator) next;
                //Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) {
                    // Tj takes one operator and that is the string to display so lets update that operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();
                            string = StringUtils.replaceOnce(string, searchString, replacement);
                            cosString.setValue(string.getBytes());
                        }
                    }
                }
            }
        }
        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(document);
        OutputStream out = updatedStream.createOutputStream();
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        page.setContents(updatedStream);
        out.close();
    }
    return document;
}

Your question parallels [this question](https://stackoverflow.com/q/34239106/1729265) with the small difference that the PDF library used there was iText. Much of [my answer to it](https://stackoverflow.com/a/34315962/1729265) applies here, too. — mkl, Aug 04 '17 at 09:52
Thanks @mkl, nice and elaborate one.. am trying to work on your suggestion, would still like to go with PDFbox solution if possible. — prasanth pai, Aug 04 '17 at 10:42
I didn't want to make you replace PDFBox by iText. What I meant by "Much of my answer to it applies here, too." were the explanations why the issue occurs, and these explanations are library independent, they are based on how PDF works in general. — mkl, Aug 04 '17 at 10:48
Thanks @mkl, it's was definitely helpful for me to understand this issue..thanks for sharing. — prasanth pai, Aug 04 '17 at 19:58

Tilman Hausherr · Accepted Answer · 2017-08-04T09:22:40.600

1

Quoting from https://pdfbox.apache.org/2.0/migration.html

Why was the ReplaceText example removed?

The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:

[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ

Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.

You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.

======================================================================

Your description suggests that the initial file has been using a font subset, that is missing the characters G, N, Q, V and Y.

And no, there is no easy workaround. You would have to delete the text you don't want from the content stream, and then append a new content stream with the text you want with a new font at the correct place.

P.S. the current PDFBox version is 2.0.7, not 2.0.2.

edited Aug 04 '17 at 09:22

answered Aug 04 '17 at 08:50

Tilman Hausherr

17,731
7
58
97

Thanks @Tilman , I tried using the latest version 2.0.7 but issue still persists.As suggested, could you pls help me with any code samples to delete a particular text from content stream and add also append text to new content stream.Would be great help .. – prasanth pai Aug 04 '17 at 10:37
I didn't claim it would work in 2.0.7. I only mentioned 2.0.7 because it's a bad practice to use outdated software. To delete a particular text, have a look at the RemoveAllText.java example, you need to modify this so that it works for your file. (the text is in `newTokens.get(newTokens.size() - 1)`, so check that it matches, can be tricky due to encoding). Append text in new content stream, see the AddMessageToEachPage.java example in the source code download. – Tilman Hausherr Aug 04 '17 at 11:06
Thanks for the code example, the delete token approach is workable but for write, my case requires a text insert at a particular location (with some x,y coordinate). Reading over a rectangle area did work well with a x,y coordinate and width,height, so would like to know if something similar is possible for write as well. – prasanth pai Aug 04 '17 at 12:56
1

Have a look at the HelloWorldTTF.java example. This writes a text at (100,700). Note that (0,0) is left-bottom, not left-top. 1 unit = 1/72 inch. – Tilman Hausherr Aug 04 '17 at 12:59
Great Thanks @Tilman, I was able to get this working. Still refining on it. Wondering if this position fixed at x,y coordinate would be consistent across different systems ( Pdf file structure though would remain consistent.). – prasanth pai Aug 04 '17 at 16:58
Yes, that's the main feature of PDF since it began in the 90ies. – Tilman Hausherr Aug 04 '17 at 17:06

Apache PDFBox replace text results in few character missed

1 Answers1