17

I am having trouble writing unicode characters out to a PDF using PDFBox. Here is some sample code that generates garbage characters instead of outputting "š". What can I add to get support for UTF-8 strings?

PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(document, page);

PDType1Font font = PDType1Font.HELVETICA;
contentStream.setFont(font, 12);
contentStream.beginText();
contentStream.moveTextPositionByAmount(100, 400);
contentStream.drawString("š");
contentStream.endText();
contentStream.close();
document.save("test.pdf");
document.close();
Lucas Moellers
  • 577
  • 2
  • 6
  • 12

1 Answers1

10

You are using one of the inbuilt 'Base 14' fonts that are supplied with Adobe Reader. These fonts are not Unicode; they are effectively a standard Latin alphabet, though with a couple of extra characters. It looks like the character you mention, a lowercase s with a caron (š), is not available in PDF Latin text... though an uppercase Š is available but curiously on Windows only. See Appendix D of the PDF specification at http://www.adobe.com/devnet/pdf/pdf_reference.html for details.

Anyway, getting to the point... you need to embed a Unicode font if you want to use Unicode characters. Make sure you are licensed to embed whatever font you decide on... I can recommend the open-source Gentium or Doulos fonts because they're free, high quality and have comprehensive Unicode support.

gutch
  • 6,959
  • 3
  • 36
  • 53
  • Thanks for the info. I tried embedding the Gentium font, but I get the following error message when opening the PDF "The font 'Gentium' contains bad /Widths". I have tried to embed other ttf files and I get the same message. I am replacing the font line with PDTrueTypeFont font = PDTrueTypeFont.loadTTF(document, "GenR102.TTF"); – Lucas Moellers Mar 25 '11 at 17:37
  • I'm sure I've had that problem too once... I can't remember exactly but I think it might be a problem with incompatible encodings. In other words, PDFBox might think the font is a Latin font instead of a Unicode font. Try setting the encoding with `font.setEncoding(...)` and see http://stackoverflow.com/questions/1713751/using-java-pdfbox-library-to-write-russian-pdf for info on `setEncoding()` – gutch Mar 28 '11 at 03:06
  • 2
    I switched to iText. It has been a lot easier to work with for displaying unicode characters. – Lucas Moellers Mar 29 '11 at 16:20
  • 4
    iText is not free – Roman Soviak Apr 13 '18 at 13:08