Printing Chinese characters in pdfbox

Question

I'm using the following set-up:

Java 11.0.1
pdfbox 2.0.15

Objective: Rendering a pdf that contains Chinese characters

Problem: java.lang.IllegalArgumentException: U+674E is not available in this font's encoding: WinAnsiEncoding

I already tried:

Using different fonts for Chinese character support. The latest one is NotoSansCJKtc-Regular.ttf
Set font to unicode as described here: Java: Write national characters to PDF using PDFBox, however the used loadTTF method is deprecated.
Using Arial-Unicode-MS_4302.ttf

My code looks like this (shortened a bit):

try (InputStream pdfIn = inputStream; PDDocument pdfDocument =
             PDDocument.load(pdfIn)) {

      PDFont formFont;
      //Check if Chinese characters are present
      if (!Util.containsHanScript(queryString)) {
        formFont = PDType0Font.load(pdfDocument,
            PdfReportGenerator.class.getResourceAsStream("LiberationSans-Regular.ttf"),
            false);
      } else {
        formFont = PDType0Font.load(pdfDocument,
            PdfReportGenerator.class.getResourceAsStream("NotoSansCJKtc-Regular.ttf"),
            false);
      }

        List<PDField> fields = acroForm.getFields();

        //Load fields into Map
        Map<String, PDField> pdfFields = new HashMap<>();
        for (PDField field : fields) {
          String key = field.getPartialName();
          pdfFields.put(key, field);
        }

        PDField currentField = pdfFields.get("someFieldID");
        PDVariableText pdfield = (PDVariableText) currentField;

        PDResources res = acroForm.getDefaultResources();
        String fontName = res.add(formFont).getName();
        String defaultAppearanceString = "/" + fontName + " 10 Tf 0 g";

        pdfield.setDefaultAppearance(defaultAppearanceString);
        pdfield.setValue("李柱");

      acroForm.flatten(fields, true);

      ByteArrayOutputStream pdfOut = new ByteArrayOutputStream();
      pdfDocument.save(pdfOut);
}

Expected result: Chinese characters on pdf.

Actual result: java.lang.IllegalArgumentException: U+674E is not available in this font's encoding: WinAnsiEncoding

So my question is about how to best support rendering of Chinese characters with pdfbox. Any help is appreciated.

On a second thought - I suspect that the font isn't used, due to the mention of WinAnsiEncoding. Could you share the PDF? — Tilman Hausherr, Aug 11 '19 at 17:58
@TilmanHausherr: The Arial Uni font is not officially supported anymore and hard to find a download. — Mirko, Aug 11 '19 at 19:40
Check the method commented [here](https://www.oipapio.com/question-4651933), the first answer where it says about japanese kanji characters (Japanese kanjis comes from Chinese kanjis, most of them it differs in the pronunciation). — riccs_0x, Aug 11 '19 at 19:50
I have checked the referenced question. I tried it with loadTTF, however a) it didn't work and b) the method is now deprecated @riccs_0x — Mirko, Aug 11 '19 at 20:10
@Mirko Perhaps changing into images https://stackoverflow.com/questions/29203976/pdfbox-outputs-question-marks-instead-of-some-japanese-characters? — riccs_0x, Aug 11 '19 at 22:34
Mirko - Tilman's answer shows that your code works, at least with the PDF and font at his hands. Thus, please share enough information and data to make your issue reproducible. — mkl, Aug 12 '19 at 11:13

score 3 · Accepted Answer · answered Aug 12 '19 at 06:56

3

The following code works for me, it uses the file of PDFBOX-4629:

PDDocument doc = PDDocument.load(new URL("https://issues.apache.org/jira/secure/attachment/12977270/Report_Template_DE.pdf").openStream());
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDVariableText field = (PDVariableText) acroForm.getField("search_query");
List<PDField> fields = acroForm.getFields();
PDFont font = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/arialuni.ttf"), false);

PDResources res = acroForm.getDefaultResources();
String fontName = res.add(font).getName();
String defaultAppearanceString = "/" + fontName + " 10 Tf 0 g";

field.setDefaultAppearance(defaultAppearanceString);
field.setValue("李柱");

acroForm.flatten(fields, true);
doc.save("saved.pdf");
doc.close();

answered Aug 12 '19 at 06:56

Tilman Hausherr

17,731
7
58
97

Thanks Tilman. This solution works for me as well, only that I use a different font for the Chinese characters. Just out of interest: Do you also get the following message in the console? INFO: OpenType Layout tables used in font NotoSansCJKTCRegular are not implemented in PDFBox and will be ignored It is just an info, however it has quite a negative impact on the performance. – Mirko Aug 12 '19 at 17:21
That INFO has been removed, although maybe not yet in 2.0.16. You need to fine tune your logging settings. – Tilman Hausherr Aug 12 '19 at 18:02
I see. I think the performance is poor, because the Chinese font is 16 MB. This might be the reason for the bad performance. – Mirko Aug 12 '19 at 20:35
There may be other fonts that are smaller. My intent was to show that it is possible. Note that you need this font only once in the default resources. IIRC further optimization is possible by saving the file after adding the font to the default resources, and reloading it before setting the values. (There was an issue or an SO question about this, I think) – Tilman Hausherr Aug 13 '19 at 03:47
@TilmanHausherr Hi Tilman, I tried the demo you shared above and can support chinese font successfully but found size of generated pdf increased large(almost 6MB) by parse ("c:/windows/fonts/simhei.ttf"), do you have any advice? Many thanks – chris Jan 05 '23 at 16:19
@chris No, this is a known problem because the font can't be subset. – Tilman Hausherr Jan 05 '23 at 16:57

Printing Chinese characters in pdfbox

1 Answers1

Linked