Why squares shown instead of symbols in output file using pdfbox

Question

Lost a week for finding solution, but still fail. Maybe know somebody: I try to replace token, eg @test to numbers 123456 in .pdf file using pdfbox.

It replace it, but in output instead of numbers I have squares or question mark inside square or numbers shown over each other. Only what I realize is that it depends on selected font. And I can’t figure out where is the mistake.

Note: we suppose that it’s a port issue and test on Java build in a v 2.0 and face with the same issue.

Maybe somebody face with similar problem and know solution?

Tech details:

Version: PDFBox.NET-1.8.9, which taken from http://www.squarepdf.net/pdfbox-in-net
Language: C#
.NET Frameworks 4.5.2
Used fonts: times new roman, tahoma, courier, calibri.

MS Word creation:

Just right click in desktop
Select Microsoft Word Document from create new point
Print inside text: @test

Script:

private void ReplaceTextInPdf(string inputPath, string outputPath) {
            PDDocument doc = null;
            try {
                File input = new File(inputPath);
                doc = PDDocument.loadNonSeq(input, null);
                List pages = doc.getDocumentCatalog().getAllPages();

                for (int i = 0; i < pages.size(); i++) {
                    PDPage page = (PDPage)pages.get(i);
                    PDStream contents = page.getContents();
                    PDFStreamParser parser = new PDFStreamParser(contents.getStream());
                    parser.parse();
                    List tokens = parser.getTokens();

                    for (int j = 0; j < tokens.size(); j++) {
                        Object next = tokens.get(j);
                        if (next is PDFOperator) {
                            PDFOperator op = (PDFOperator)next;
                            //Tj and TJ are the two operators that display
                            //strings in a PDF
                            if (op.getOperation() == "Tj") {
                                //Tj takes one operator and that is the string
                                //to display so lets update that operator
                                COSString previous = (COSString)tokens.get(j - 1);
                                String tempString = previous.getString();

                                tempString = tempString.replace("@test", "123456");

                                previous.reset();
                                previous.append(tempString.getBytes());
                            } else if (op.getOperation() == "TJ") {
                                String tempString = "";
                                COSString cosString = null;
                                COSArray previous = (COSArray)tokens.get(j - 1);
                                for (int k = 0; k < previous.size(); k++) {
                                    Object arrElement = previous.getObject(k);
                                    if (arrElement is COSString) {
                                        cosString = (COSString)arrElement;
                                        tempString += cosString.getString();
                                        cosString.reset();
                                    }
                                }

                                if (tempString != null && tempString.trim().length() > 0) {

                                    tempString = tempString.replace("@test", "123456");

                                    for (int k = 0; k < previous.size(); k++) {
                                        Object arrElement = previous.getObject(k);
                                        if (arrElement is COSString) {
                                            cosString.reset();
                                            cosString.append(tempString.getBytes("ISO-8859-1"));
                                            break;
                                        }
                                    }
                                }
                            }
                        }
                    }

                    //now that the tokens are updated we will replace the
                    //page content stream.
                    PDStream updatedStream = new PDStream(doc);
                    OutputStream out1 = updatedStream.createOutputStream();
                    ContentStreamWriter tokenWriter = new ContentStreamWriter(out1);
                    tokenWriter.writeTokens(tokens);
                    page.setContents(updatedStream);
                }

                doc.save(outputPath);
            } finally {
                if (doc != null) {
                    doc.close();
                }
            }
        }

white squares usually imply the wrong font used to get the symbol you expected — BugFinder, Nov 02 '18 at 07:58
*"And I can’t figure out where is the mistake."* - the mistake is that your replacement code works only for very specific pdfs. This also is why the example from which that code is derived has been removed from pdfbox in the 2.x versions. — mkl, Nov 02 '18 at 08:28
@mkl but if I understood correctly in C# we have 1.8 version officially. — BorHunter, Nov 02 '18 at 08:35
https://pdfbox.apache.org/2.0/migration.html#why-was-the-replacetext-example-removed Yes it was (and still is) in 1.8. But that doesn't invalidate the arguments from the link. — Tilman Hausherr, Nov 02 '18 at 08:37
@TilmanHausher please can you explain in details? I'm not familiar with pdf's and other chars. — BorHunter, Nov 02 '18 at 08:57
*"please can you explain in details?"* - your replacement code works only for very specific pdfs. Most likely your PDF is not one of them; much less likely there is a different issue. Without the PDF file, though, it's hard to go into details beyond the link @Tilman provided. — mkl, Nov 02 '18 at 09:08
@mkl added 2 input files into g drive https://drive.google.com/drive/folders/18cT0tTLWSpPdzubxXH5E8ZGTvN6vY3q-?usp=sharing — BorHunter, Nov 02 '18 at 10:57

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

In general

First of all, the code you use only works under favorable circumstances, i.e. only for PDFs generated in a special way. While PDFs in earlier years fairly often were created that way, nowadays they mostly aren't anymore. This has led to the removal of the PDFBox example from which that code was derived from the source code base of PDFBox 2.0.

The matching entry in the migration guide explains:

Why was the ReplaceText example removed?

The ReplaceText example has been removed as it gave the incorrect illusion that text can be replaced easily. Words are often split, as seen by this excerpt of a content stream:
[ (Do) -29 (c) -1 (umen) 30 (tation) ] TJ
Other problems will appear with font subsets: for example, if only the glyphs for a, b and c are used, these would be encoded as hex 0, 1 and 2, so you won’t find “abc”. Additionally, you can’t replace “c” with “d” because it isn’t part of the subset.

You could also have problems with ligatures, e.g. “ff”, “fl”, “fi”, “ffi”, “ffl”, which can be represented by a single code in many fonts. To understand this yourself, view any file with PDFDebugger and have a look at the “Contents” entry of a page.

See also PDFBox 2.0 RC3 -- Find and replace text

(Migration to PDFBox 2.0.0)

The problem due to words split for kerning has mostly been circumvented in your code by concatenating the string parameter chunks for the TJ operator. The remaining issues remain, though.

In case of your example documents

In case of your example document the problem is that the replacing "numbers show over each other":

==>

The cause is similar to the "font subsets" problem mentioned in the migration guide. The TTF font program in question is not embedded, though, so it's not a true "font subset" issue. But the font related information stored in the PDF is only correct for the glyphs actually used in the original PDF, i.e. '@', 'e', 's', and 't', but not for the replacement glyphs, i.e. the digits '1' though '6'.

The glyph-specific information relevant in the case at hand is the glyph width: Only for the originally used glyphs it is correctly given, for all other glyphs the given width is 0! The consequence: After drawing one of your replacement glyphs the position for drawing the next glyph to come is not appropriately shifted but stays the same (as appropriate for 0 width glyphs), so the next glyph drawn is starting at the same position, effectively drawing all your replacement glyphs over each other.

(More concretely, the widths array for that font looks like this:

[ 250 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 921 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 444 0 0 0 0 0 0 0 0 0 0 0 0 0 389 278]

with '@', 'e', 's', and 't' being encoded using the WinAnsiEncoding and the font consisting of the range from '@' to 't'.)

In this special case you probably can fix the issue by somewhere invisibly (e.g. white on white) printing in your Word template a string with all characters from the font you may probably want to use as replacements for your placeholder.

In general, though, the encoding needs not be something ASCII'ish like WinAnsiEncoding but instead may be completely different, probably even made up for the occasion, e.g. #1 for the first glyph used on the page, #2 for the second, different glyph on that page, etc. Thus, in general a work-around is not so easy to find.

Are there any other libraries/ways to find/replace text in an existing PDF? — AElMehdi, Apr 28 '20 at 16:41
@AElMehdi I explained an approach using PDFBox in [this answer](https://stackoverflow.com/a/61411728/1729265) in the section "An approach". A generic implementation is far beyond the scope of a stackoverflow question but a fairly decent attempt should require not more than a few weeks or months of developing time. — mkl, Apr 28 '20 at 17:26

Why squares shown instead of symbols in output file using pdfbox

1 Answers1

In general

Why was the ReplaceText example removed?

In case of your example documents