6

I'm using PDFBox 2.0.0-SNAPSHOT to build a PDF in Java. It is working fine for very basic characters (e.g. [a-zA-Z9-0]) but I'm getting encoding errors for slightly more advanced characters such as (quoteright). Here's my code:

PDDocument pdf = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
pdf.addPage(page);

PDPageContentStream contents = new PDPageContentStream(pdf, page);
PDFont font = PDType1Font.HELVETICA;
contents.beginText();
contents.setFont(font, 12);

// ...

String text = "’";
contents.showText(text);

contents.endText();
contents.close();

I get this exception:

Can't encode U+2019 in font Helvetica. Type 1 fonts only support 8-bit code points

I looked up the supported characters for non-embedded fonts in Section D.1 of the PDF specification, and this character should be supported.

Indeed, if I use this trick, I can insert the correct character:

// ...

// String text = "’";
// contents.showText(text);
byte[] commands = "(x) Tj ".getBytes();
commands[1] = (byte)145;    // = 221 octal = quoteright in WinAnsi
contents.appendRawCommands(commands);

// ...

But this isn't really a practical solution. Aside from the inconvenience of manually searching for every character that might be in the string, the appendRawCommands method is now deprecated.

So, what's going on here? From the answer from above it is implied that showText should not have the issues present with the old drawString method, but something clearly isn't working.

EDIT: As requested in the comments, here is the full stack trace of the exception:

Exception in thread "main" java.lang.IllegalArgumentException: Can't encode U+2019 in font Helvetica. Type 1 fonts only support 8-bit code points
    at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:343)
    at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:285)
    at org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:314)
    at com.fatfractal.test.PDFBoxTest.textWidth(PDFBoxTest.java:148)
    at com.fatfractal.test.PDFBoxTest.showFlowingTextAt(PDFBoxTest.java:128)
    at com.fatfractal.test.PDFBoxTest.build(PDFBoxTest.java:73)
    at com.fatfractal.test.PDFBoxTest.main(PDFBoxTest.java:97)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Community
  • 1
  • 1
shawkinaw
  • 3,190
  • 2
  • 27
  • 30
  • 1
    Possible duplicate: http://stackoverflow.com/questions/5425251/using-pdfbox-to-write-utf-8-encoded-strings-to-a-pdf – Ya Wang Nov 10 '15 at 18:57
  • 2
    This is not a duplicate, the characters I'm talking about are in the base fonts according to the spec. – shawkinaw Nov 10 '15 at 18:59
  • You are referencing Adobe vs pdfbox what do you expect reference something from the right source like, https://pdfbox.apache.org/2.0/... You know that Adobe doesn't make PDFBox? Apache maintains pdfbox why are you referencing what adobe can do VS what pdfbox can do (when the two are completely different things)? – Ya Wang Nov 10 '15 at 19:04
  • 4
    Well it's the format the PDF standard ISO-32000-1 is based upon (http://www.iso.org/iso/home/news_index/news_archive/news.htm?refid=Ref1141). I am aware that PDFBox isn't made by Adobe, I don't see why you have to take such a tone. – shawkinaw Nov 10 '15 at 19:18
  • The problem is that you are trying to write the unicode character for quote right, which is not the same character as what is used by the windows 1252 character set. you essentially want to write the character "\u0092". – jtahlborn Jan 04 '16 at 19:49
  • can you post the full stack trace of the exception? – jtahlborn Jan 04 '16 at 20:17
  • @jtahlborn Have added stack trace to question text. – shawkinaw Jan 04 '16 at 20:58
  • * From the answer from above it is implied that showText should not have the issues present with the old drawString method, but something clearly isn't working.* - `showText` is improved. `drawString` always assumed a fixed encoding while `showText` asks the font to do the encoding. Unfortunately `PDType1Font` has a broken `encode` method. It works well with composite fonts, though. – mkl Jan 05 '16 at 07:59
  • @maaartinus As you offered the bounty... do the answers satisfy or are there still open issues in your eyes? – mkl Jan 06 '16 at 11:22
  • @mkl I didn't have the time to look at it before. Now I awarded it the bounty to the first answer, though yours is more complete. – maaartinus Jan 06 '16 at 15:25

2 Answers2

7

Looking at the PDFBox code, it really seems like a bug. If you look at the PDType1Font.encode() method, it automatically throws if the code point is > 0xFF. However, if the logic instead proceeded in this case, the GlyphList would convert the "\u2019" character to "quoteright", which would then be a valid character in the font.

jtahlborn
  • 52,909
  • 5
  • 76
  • 118
6

As @jtahlborn explained in his answer, PDType1Font.encode() is broken in the current 2.0.0 release candidate.

In contrast to the 1.x.x PDPageContentStream method drawString, though, the 2.0.0 release candidate method showText is encoding aware.

As a work-around, therefore, you could use a composite font with subset embedding instead, e.g. on a standard MS Windows installation:

InputStream fontStream = new FileInputStream("c:/Windows/Fonts/ARIALUNI.TTF");
PDType0Font font = PDType0Font.load(pdf, fontStream);

Using this font your code will not fail for "’" because composite font classes do not have the bug observed in PDType1Font here.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • You can include the fontbox library which will has a FontFileFinder for getting the font files in less system dependent manner. – matt Jan 05 '16 at 08:35
  • @matt Definitively a good idea for production code. Alternatively bring your own fonts along, e.g. as resources, if you don't want to depend on fonts found on the deployment computer. – mkl Jan 05 '16 at 11:20
  • 2
    I just had a look at the `FontFileFinder`; in case of Windows computers it uses `RunTime.exec()`. Depending on the environment in which PDFBox is to be used, use of `RunTime.exec()` may be restricted by means of a `SecurityManager`. So if you are planning to deploy into a secured environment, you definitively should bring your own fonts. – mkl Jan 05 '16 at 11:29