37

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:

(something)

There is also the option to escape a character with octal codes:

(\527)

but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.


Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.

The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.


Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.


Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.

chills42
  • 14,201
  • 3
  • 42
  • 77
Marcus Downing
  • 10,054
  • 10
  • 63
  • 85
  • 6
    In addition to wrapping strings with `()`, you can also use `<>`. Within the gt/lt, you use hex numbers rather than letters. Much less efficient, but you don't need to worry about escapes. `` : "Hello World!" as a Unicode-16 string. Plinth's post is also important... you MUST use FE FF. FFFE is Bad. For some reason. :/ – Mark Storer Feb 08 '11 at 18:50
  • 2
    @MarkStorer, It has to be `FEFF` because it has to be UTF-16BE. – Marius Oct 09 '19 at 14:55
  • These work for me (give áéíóú correctly) \341\351\355\363\372 but these don't \527\777 (they display as Wÿ) - is there a way to know which ones are going to work with eg <> – murkle Dec 24 '19 at 10:58
  • 1
    @murkle the octal value is stripped to 8 bits – mirabilos Feb 07 '21 at 07:08
  • @mirabilos thanks!!! That seems to be very hard information to find online :) – murkle Feb 08 '21 at 08:03
  • 1
    @murkle http://paulbourke.net/dataformats/postscript/ and the linked PDF explain this for PostScript, and PDF is the same. https://czyborra.com/draft/printing.html tells a _very_ sad story though ☹ – mirabilos Feb 08 '21 at 14:51

8 Answers8

39

In the PDF reference in chapter 3, this is what they say about Unicode:

Text strings are encoded in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode is described in the Unicode Standard by the Unicode Consortium (see the Bibliography). For text strings encoded in Unicode, the first two bytes must be 254 followed by 255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified in the Unicode standard. (This mechanism precludes beginning a string using PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to be a meaningful beginning of a word or phrase).

plinth
  • 48,267
  • 11
  • 78
  • 120
  • 18
    I knew this sounded too good to be true. The "text strings" are used for document metadata (annotations, bookmark names), **not** for rendered text! – Brecht Machiels Sep 05 '12 at 21:26
  • 2
    @BrechtMachiels At least in the PDF 1.7 reference, the Text object (`BT`) text display operator (`Tj`) explicitly says "Show a text string." Which means that they can be UTF-16BE encoded as described. – jdmichal Mar 05 '15 at 23:00
  • 3
    @jdmichal That won't work automatically. The encoding of the strings can only be UTF-16BE if the font supports it (effectively, it has to be a CID font with a ToUnicode value and several other elements). – plinth Mar 06 '15 at 13:41
  • 1
    There's one detail I cannot seem to wrap my head around: Can UTF-16BE be used together with the `(text string)` syntax? This syntax implies, that the single-byte `)` character is the termination marker. What about UTF-16BE code units, where the high byte happens to have the value `29h`? Would all of those need to be escaped? Or does UTF-16BE mandate use of hex strings (``)? – IInspectable Jan 15 '17 at 14:03
  • If the string is UTF-16BE, then the string has to be an even number of bytes. There are no conflicts with high bytes. – plinth Jan 17 '17 at 15:00
  • You made my day with that paragraph (when working on a feature that involves poppler pdf strings). Thanks! – Nelson Sep 29 '18 at 14:28
  • Does anyone have a simple PDF example to show this? I've had the same problem as here: https://github.com/tesseract-ocr/tesseract/issues/1150 ie https://github.com/tesseract-ocr/tesseract/files/1331279/hello-bom.pdf seems to render as HELLO in Adobe PDF Reader – murkle Dec 24 '19 at 09:12
  • I have used PDFSharp to extract text from a japanese pdf. The string i recieve \0-\0D\0S\0D\0Q\0H\0V\0H\0*\0U\0D\0P\0P\0D\0U\0*\0X\0L\0G\0H\01\0R\0Y\0H\0P\0E\0H\0U\0\u0015\0\u0014\0\u000f\0\u0015\0\u0013\0\u0014\0\u0015 which should decode to J a p a n e s e G r a m m a r G u i d e N o v e m b e r 2 1 , 2 0 1 2 despite using "\u00##" does not seem to match unicode. i am fairly certain this encoding is correct as the numerical relationships between characters match. What is this strange encoding, it is not standard Unicode – Jody Sowald Aug 05 '20 at 14:25
14

The simple answer is that there's no simple answer. If you take a look at the PDF specification, you'll see an entire chapter — and a long one at that — devoted to the mechanisms of text display. I implemented all of the PDF support for my company, and handling text was by far the most complex part of exercise. The solution you discovered — use a 3rd party library to do the work for you — is really the best choice, unless you have very specific, special-purpose requirements for your PDF files.

Derek Clegg
  • 212
  • 1
  • 4
12

Algoman's answer is wrong in many things. You can make a PDF document with Unicode in it and it's not rocket science, though it needs some work. Yes he is right, to use more than 255 characters in one font you have to create a composite font (CIDFont) pdf object. Then you just mention the actual TrueType font you want to use as a DescendatFont entry of CIDFont. The trick is that after that you have to use glyph indices of a font instead of character codes. To get this indices map you have to parse cmap section of a font - get contents of the font with GetFontData function and take hands on TTF specification. And that's it! I've just did it and now I have a Unicode PDF!

Sample Code for parsing cmap section is here: https://web.archive.org/web/20150329005245/http://support.microsoft.com/en-us/kb/241020

And yes, don't forget /ToUnicode entry as @user2373071 pointed out or user will not be able to search your PDF or copy text from it.

wec
  • 237
  • 3
  • 14
dredkin
  • 143
  • 1
  • 8
  • 1
    This is correct. To elaborate (since it was a slightly frustrating job to figure out the details): generate a CIDFont with your font as the `/BaseFont`, the same `/FontDescriptor` and a `/CIDSystemInfo` value of `<< /Registry (Adobe) /Ordering (Identity-H) /Supplement 0 >>`. Generate a Type0 font with your font as the `/BaseFont`, `/Encoding` `/Identity-H` and the CIDFont as its DescendantFont. You can then use that to encode your text in 16-bit big-endian glyph indices (parse the font's `cmap` table to translate Unicode to glyph indices), which you will probably want to emit as hex strings. – Tau Sep 03 '22 at 20:39
  • 1
    Thanks, all. I couldn't make this work with inbuilt fonts but it's OK with embedded TrueType. I had to include widths (/W) in the CIDFont (CIDFontType2) to get correct spacing. I've also posted a response around /ToUnicode (originally part of this comment, but it became too tortuous) – Tim V Nov 30 '22 at 20:20
6

As dredkin pointed out, you have to use the glyph indices instead of the Unicode character value in the page content stream. This is sufficient to display Unicode text in PDF, but the Unicode text would not be searchable. To make the text searchable or have copy/paste work on it, you will also need to include a /ToUnicode stream. This stream should translate each glyph in the document to the actual Unicode character.

user2373071
  • 310
  • 5
  • 10
4

See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and character sets pre-defined in a PDF consumer application. To display other characters you need to embed a font that contains them. It is also preferable to embed only a subset of the font, including only required characters, in order to reduce file size. I am also working on displaying Unicode characters in PDF and it is a major hassle.

Check out PDFBox or iText.

http://www.adobe.com/devnet/pdf/pdf_reference.html

jm4
  • 198
  • 1
  • 3
4

I have worked several days on this subject now and what I have learned is that unicode is (as good as) impossible in pdf. Using 2-byte characters the way plinth described only works with CID-Fonts.

seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).

So to use unicode in pdf directly

  1. you would have to convert normal fonts to CID-Fonts, which is probably extremely hard - you'd have to generate the graphics routines from the original font(?), extract character metrics etc.
  2. you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts
  3. also, 2-byte characters don't even cover the full Unicode space

IMHO, these points make it absolutely unfeasible to use unicode directly.



What I am doing instead now is using the characters indirectly in the following way: For every font, I generate a codepage (and a lookup-table for fast lookups) - in c++ this would be something like

std::map<std::string, std::vector<wchar_t> > Codepage;
std::map<std::string, std::map<wchar_t, int> > LookupTable;

then, whenever I want to put some unicode-string on a page, I iterate its characters, look them up in the lookup-table and - if they are new, I add them to the code-page like this:

for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{                
    if(LookupTable[fontname].find(*i) == LookupTable[fontname].end())
    {
        LookupTable[fontname][*i] = Codepage[fontname].size();
        Codepage[fontname].push_back(*i);
    }
}

then, I generate a new string, where the characters from the original string are replaced by their positions in the codepage like this:

static std::string hex = "0123456789ABCDEF";
std::string result = "<";
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{                
    int id = LookupTable[fontname][*i] + 1;
    result += hex[(id & 0x00F0) >> 4];
    result += hex[(id & 0x000F)];
}
result += ">";

for example, "H€llo World!" might become <01020303040506040703080905> and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...

but you now have a problem: the pdf doesn't know that you mean "H" by a 01. To solve this problem, you also have to include the codepage in the pdf file. This is done by adding an /Encoding to the Font object and setting its Differences

For the "H€llo World!" example, this Font-Object would work:

5 0 obj 
<<
    /F1
    <<
        /Type /Font
        /Subtype /Type1
        /BaseFont /Times-Roman
        /Encoding
        <<
          /Type /Encoding
          /Differences [ 1 /H /Euro /l /o /space /W /r /d /exclam ]
        >>
    >> 
>>
endobj 

I generate it with this code:

ObjectOffsets.push_back(stream->tellp()); // xrefs entry
(*stream) << ObjectCounter++ << " 0 obj \n<<\n";
int fontid = 1;
for(std::list<std::string>::iterator i = Fonts.begin(); i != Fonts.end(); i++)
{
    (*stream) << "  /F" << fontid++ << " << /Type /Font /Subtype /Type1 /BaseFont /" << *i;

    (*stream) << " /Encoding << /Type /Encoding /Differences [ 1 \n";
    for(std::vector<wchar_t>::iterator j = Codepage[*i].begin(); j != Codepage[*i].end(); j++)
        (*stream) << "    /" << GlyphName(*j) << "\n";
    (*stream) << "  ] >>";

    (*stream) << " >> \n";
}
(*stream) << ">>\n";
(*stream) << "endobj \n\n";

Notice that I use a global font-register - I use the same font names /F1, /F2,... throughout the whole pdf document. The same font-register object is referenced in the /Resources Entry of all pages. If you do this differently (e.g. you use one font-register per page) - you might have to adapt the code to your situation...

So how do you find the names of the glyphs (/Euro for "€", /exclam for "!" etc.)? In the above code, this is done by simply calling "GlyphName(*j)". I have generated this method with a BASH-Script from the list found at

http://www.jdawiseman.com/papers/trivia/character-entities.html

and it looks like this

const std::string GlyphName(wchar_t UnicodeCodepoint)
{
    switch(UnicodeCodepoint)
    {
        case 0x00A0: return "nonbreakingspace";
        case 0x00A1: return "exclamdown";
        case 0x00A2: return "cent";
        ...
    }
}

A major problem I have left open is that this only works as long as you use at most 254 different characters from the same font. To use more than 254 different characters, you would have to create multiple codepages for the same font.

Inside the pdf, different codepages are represented by different fonts, so to switch between codepages, you would have to switch fonts, which could theoretically blow your pdf up quite a bit, but I for one, can live with that...

Algoman
  • 1,905
  • 16
  • 16
  • by the way - the list of glyphs I mentioned, contains more than 3600 entries. The generated code file is 175 KiB and the compiled object file is 600 KiB large (1.1 MiB in the debug version) – Algoman Aug 05 '15 at 11:35
  • 1
    As soon as you start using other fonts than the standard 14 fonts, CID fonts can become quite natural. – mkl Aug 05 '15 at 19:45
  • 2
    *1. you would have to convert normal fonts to CID-Fonts, which is probably extremely hard* - this is fairly straightforward for OpenType (with CFF or TrueType outlines) fonts. These can be included as `CIDFontType0` (CFF) or `CIDFontType2` (TrueType) using `Identity-H` encoding. This what I do in [rinohtype](https://github.com/brechtm/rinohtype/blob/6e6b024e757eff57a8cef143710e667e0d2f365f/rinoh/backend/pdf/__init__.py#L75). – Brecht Machiels Jan 19 '16 at 15:40
  • 1
    *2. you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts* - I don't think this is correct. As far as I can tell the only difference in displaying text is that you need to use 16-bit character codes. – Brecht Machiels Jan 19 '16 at 15:52
0

dredkin's answer has worked fine for me in the forward direction (unicode text to PDF representation).

I was writing an increasingly convoluted comment there about the reverse direction (PDF representation to text, when copying from the PDF document), explained by user2373071. The method referred to throughout this thread is the definition of a /ToUnicode map (which, incidentally, is optional). I found it simplest to map from glyphs to characters using the beginbfrange srcCode1 srcCode2 [ dstString1 m ] endbfrange construct.

This seems to work OK in Adobe Reader, but two glyphs (0x100 and 0x1ef) cause the mapping for cyrillic characters to fail in browsers and SumatraPDF (the copy/paste provides the glyph IDs instead of the characters. By excluding those two glyphs I made it work there. (I really can't see what's special about these glyphs, and it's independent of font (i.e. it's the same glyphs, but different characters, in Times/Georgia/Palatino, and these values are afaik identically mapped in UTF-16. Any ideas welcome!)

However, and more importantly, I have reached the conclusion that the whole /ToUnicode mechanism is fundamentally flawed in concept, because many fonts re-use glyphs for multiple characters. Consider simple ones like 0x20 and 0xa0 (ordinary and non-breaking space); 0x2d and 0xad (hyphen and soft hyphen); these two are in the 8-bit character range. Slightly beyond that are 0x3b and 0x37e (semi-colon and greek question mark). And it would be quite reasonable to re-use cyrillic small a and latin small a, and similar homoglyphs. So the point is, in the non-ASCII world that prompts us to worry about Unicode at all, we will encountering a one-to-many mapping from glyphs to characters, and will therefore be bound to pick up the wrong character at some point - which rather removes the point of being able to extract the text in the first place.

The other method in the (1.7) PDF reference is to use /ActualText instead of /ToUnicode. This is better in principle, because completely avoids the homoglyph problem I've mentioned above, and the overhead is probably bearable, but it only seems to be implemented in Adobe Reader (i.e. I haven't got anything consistent or meaningful from SumatraPdf or four browsers).

Tim V
  • 1,255
  • 1
  • 9
  • 5
  • Thanks. (1) On /ActualText, I suspected that and broke the text into words; behaviour is still erratic around spaces, (similar to the discussion in the article you cite). 92) On 0x100, this is a glyphID, not a Unicode code point - and I'm not aware of any requirement that glyphs should be so numbered, so suspect that the browsers may be confusing the two. (3) I should probably have been clearer that I am using 4 bytes for all these hex numbers in the PDF file. Glyph IDs between 0x0100 and 0x01ef, and from 0x0190 onwards, are fine, so I don't think it's an encoding problem. Still playing... – Tim V Dec 01 '22 at 16:55
  • Thanks again. I tried using FEFF at the beginning of the four-digit hex strings in /ToUnicode but it broke what was already working (though I may have misunderstood you). I'm parking this for now as it works in Adobe, and my use case can live with that requirement as a workaround. I may revisit it at some point in the future. – Tim V Dec 02 '22 at 18:36
-3

I'm not a PDF expert, and (as Ferruccio said) the PDF specs at Adobe should tell you everything, but a thought popped up in my mind:

Are you sure you are using a font that supports all the characters you need?

In our application, we create PDF from HTML pages (with a third party library), and we had this problem with cyrillic characters...

Filini
  • 2,109
  • 2
  • 22
  • 32
  • We're sticking to the basic fonts that are on every computer, and not embedding any fonts. – Marcus Downing Sep 24 '08 at 17:04
  • 4
    "PDF specs at Adobe should tell you everything". It should, unfortunately, in my experience, they don't. – Renan Sep 06 '11 at 01:39
  • 2
    @Renan: "PDF specs at Adobe should tell you everything". It should, unfortunately, in my experience, you don't find them easily and they are often unnecessarily complicated. – Algoman Jul 07 '15 at 10:31
  • @Renan Indeed! Some of the 1.7 spec is even wrong! Bah, I say! Bah! – Benjamin Nolan Jan 13 '21 at 15:49