3

I'm using PDFBox 2.0.1.

I try to dynamically add some (user provided) UTF8 text to the form fields and show the result to the user. Unfortunately either the pdf library is not capable of properly encoding special characters such as "äöü"... or I was not able find any useful documentation that could help me with this issue.

Can someone tell me what is wrong with the given code sample?

try (PDDocument document = PDDocument.load(pdfTemplate)) {
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    PDAcroForm form = catalog.getAcroForm();

    List<PDField> fields = form.getFields();
    for (PDField field : fields) {
        switch (field.getPartialName()) {
            case "devices":
                // Frontend (JS): userInput = btoa('Gerät')
                String userInput = ...
                String name = new String(Base64.getDecoder().decode(base64devices), "UTF-8");
                field.setReadOnly(true);
                break;
        }
    }
    form.flatten(fields, true);
    document.save(bos);
}

And here the stacktrace of the error:

java.lang.IllegalArgumentException: U+FFFD is not available in this font's encoding: WinAnsiEncoding
    org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.encode(PDTrueTypeFont.java:368)
    org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:286)
    org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:315)
    org.apache.pdfbox.pdmodel.interactive.form.PlainText$Paragraph.getLines(PlainText.java:169)
    org.apache.pdfbox.pdmodel.interactive.form.PlainTextFormatter.format(PlainTextFormatter.java:182)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.insertGeneratedAppearance(AppearanceGeneratorHelper.java:373)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceContent(AppearanceGeneratorHelper.java:237)
    org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceValue(AppearanceGeneratorHelper.java:144)
    org.apache.pdfbox.pdmodel.interactive.form.PDTextField.constructAppearances(PDTextField.java:263)
    org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.refreshAppearances(PDAcroForm.java:324)
    org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.flatten(PDAcroForm.java:213)
    my.application.service.PDFService.generatePDF(PDFService.java:201)

I also found those (related) issues on SO:

pdfbox: ... is not available in this font's encoding But that does not help me choose the right encoding or how. IIRC Java uses UTF16 internally for character encoding why is the default not enough though? Is that an issue of the PDF-document itself or the code I use to set it?


PdfBox encode symbol currency euro Well its dynamic user input, so there are way to many things I would have to replace myself.

Thus, if the PDFBox people decided to fix the broken PDFBox method, this seemingly clean work-around code here would start to fail as it would then feed the fixed method broken input data.

Admittedly, I doubt they will fix this bug before 2.0.0 (and in 2.0.0 the fixed method has a different name), but one never knows...

Unfortunately I was not able to find this other setter method, but it might also be a different scope it does apply to.

EDIT

Updated example code to better represent the problem.

Community
  • 1
  • 1
ST-DDT
  • 2,615
  • 3
  • 30
  • 51
  • There exist both `PDFTrueTypeFont.setFontEncoding(new WinAnsiEncoding());` and `~.setEncoding(COSName.WIN_ANSI_ENCODING);` – Joop Eggen Sep 27 '16 at 09:14
  • @JoopEggen Sorry I don't know what you are trying to tell me here. Do I need to set one or both? Or should I set it to something else? – ST-DDT Sep 27 '16 at 09:26
  • I have seen it set both. Whether it helps I doubt, as the bug reports do no seem to mention setting both as a viable solution. But who knows? – Joop Eggen Sep 27 '16 at 09:28
  • 1
    Please retry with 2.0.3. If it still doesn't work, then please open an issue in JIRA and attach your PDF and your code. I wonder where the U+FFFD comes from. "ä" is supported in WinAnsiEncoding. – Tilman Hausherr Sep 27 '16 at 09:36
  • 1
    @ST-DDT I just tried to reproduce the issue but I couldn't. Thus either there is something very peculiar in your PDF or your Java editor and your Java compiler have different assumptions concerning the encoding of your code. To investigate into the former option, please share the PDF in question. To check the latter one replace the 'ä' with a '\u00E4' as proposed in Simone's answer. – mkl Sep 27 '16 at 12:09
  • @mkl I updated example code to better represent the problem. The value "Gerät" is provided by the user via a base64 encoded parameter. – ST-DDT Sep 27 '16 at 12:58
  • Javascript uses UTF-16 encoded String so you might need to use: `new String(Base64.getDecoder().decode(base64devices), "UTF-16");` – Simone Rondelli Sep 27 '16 at 13:00
  • 1
    You might be interested in this answer as well http://stackoverflow.com/questions/30106476/using-javascripts-atob-to-decode-base64-doesnt-properly-decode-utf-8-strings – Simone Rondelli Sep 27 '16 at 13:04
  • @ST-DDT As you have accepted Simone's answer, I assume your issue is resolved. – mkl Sep 27 '16 at 13:37
  • @SimoneRondelli That link gave me the solution, please add it to your answer for completeness. – ST-DDT Sep 27 '16 at 13:52
  • refer to this similar post here in the blog https://stackoverflow.com/a/76535960/6095444 – Pravin Bansal Jun 22 '23 at 22:12

1 Answers1

3

U+FFFD is used to replace an incoming character whose value is unknown or unrepresentable in Unicode compare the use of U+001A as a control character to indicate the substitute function (source).

That said it is likely that that character gets messed up somewhere. Maybe the encoding of the file is not UTF-8 and that's why the character is messed up.

As a general rule you should only write ASCII characters in the source code. You can still represent the whole Unicode range using the escaped form \uXXXX. In this case ä -> \u00E4.

-- UPDATE --

Apparently the problem is in how the user input get encoded/decoded from client/server side using the JS function btoa. A solution to this problem can be found at this link:

Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

Community
  • 1
  • 1
Simone Rondelli
  • 356
  • 1
  • 18
  • Sorry for not being precise enough in my example. The value "Gerät" is provided by the user. Via Base64 encoding. See updated example. – ST-DDT Sep 27 '16 at 12:59
  • See my comments above, the problem it's probably in the encoding/decoding of the string. Have you tried to print out the decoded value of the string on the Console? – Simone Rondelli Sep 27 '16 at 13:08
  • The input from the JS frontend is somewhy encoded in ISO-8859-1. I will try to fix it there. Thanks for the support. Any idea how I can strip that/any bad character from my string? `Pattern.compile(new String({0xFF, 0xFD},"UTF-8").replaceAll(input)`? – ST-DDT Sep 27 '16 at 13:38