Read Japanese characters in a PDF file

Question

I have the following command:

[<0e0f0a52030d030e0ce5030f0744030f>10<030d>10<0cd4>]TJ

I know that it hides Japanese in the Hex sections, because that is the only thing in the PDF, and this line is in the only content stream of a lonely page in the pdf file.

Problem is no matter how I try to decode this Hex strings I end up with Gibberish, I've decoded these Hex strings to bytes, and have tried literately applying every charset I could find, and still I get Gibberish.

(Perhaps I was desperate, because I knew it would have probably not work as well) I've also tried it the other way, testing this on Android and I'm able to import the pdf Japanese text(load it from the resource), and while debugging I can see the REAL Japanese text in the value of the String instance, again I've tried applying all the charset only to produce a 4-6 matching hex chars to the entire file, but again... nothing.

I actually don't need the Glyph, I would settle for the correct text...

Could it be that the text itself is encoded by something other than a charset encoding? Can anyone point me in the right direction?

=== UPDATE ===

OK, So I figured out that there is an extra "encryption", Identity-H, and I've read here that you need a /ToUnicode map which I cannot seem to find in the file.

What drives me nuts is that other PDF Viewers can show the PDF, and I cannot figure how!

Again, any bone would be nice... hell I'll go for scraps :)

Thanks,

Adam.

For some file context:

...
10 0 obj
    << 
    /Type /Page 
    /Parent 7 0 R 
    /Resources 11 0 R 
    /Contents 16 0 R 
    /MediaBox [ 0 0 595 842 ] 
    /CropBox [ 0 0 595 842 ] 
    /Rotate 0 
    >> 
endobj
11 0 obj
    << 
    /ProcSet [ /PDF /Text ] 
    /Font << /TT2 13 0 R /G1 12 0 R >> 
    /ExtGState << /GS1 19 0 R >> 
    /ColorSpace << /Cs6 15 0 R >> 
    >> 
endobj
12 0 obj
    << 
    /Type /Font 
    /Subtype /Type0 
    /BaseFont /Ryumin-Light-Identity-H 
    /Encoding /Identity-H 
    /DescendantFonts [ 18 0 R ] 
    >> 
endobj
13 0 obj
    << 
    /Type /Font 
    /Subtype /TrueType 
    /FirstChar 32 
    /LastChar 32 
    /Widths [ 278 ] 
    /Encoding /WinAnsiEncoding 
    /BaseFont /Century 
    /FontDescriptor 14 0 R 
    >> 
endobj
14 0 obj
    << 
    /Type /FontDescriptor 
    /Ascent 985 
    /CapHeight 0 
    /Descent -216 
    /Flags 34 
    /FontBBox [ -165 -307 1246 1201 ] 
    /FontName /Century 
    /ItalicAngle 0 
    /StemV 0 
    >> 
endobj
15 0 obj
    [ 
    /ICCBased 20 0 R 
    ]
endobj
16 0 obj
    << /Length 2221 /Filter /FlateDecode >> 
        stream
        ...
                [<0e0f0a52030d030e0ce5030f0744030f>10<030d>10<0cd4>]TJ
        ...
                <00e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e700e7>Tj
        ...
                <030e030d0a48064403740353035a039408030ebd074807c1036e0358039304e10c8802a2074807c10cd40e8a030e030d02a303770a2a0a100374036d034d036f00e7>Tj
        ...
    endstream
endobj
17 0 obj
    << 
    /Type /FontDescriptor 
    /Ascent 723 
    /CapHeight 709 
    /Descent -241 
    /Flags 6 
    /FontBBox [ -170 -331 1024 903 ] 
    /FontName /Ryumin-Light 
    /ItalicAngle 0 
    /StemV 69 
    /XHeight 450 
    /Style << /Panose <010502020300000000000000>>> 
    >> 
endobj
18 0 obj
    << 
    /Type /Font 
    /Subtype /CIDFontType0 
    /BaseFont /Ryumin-Light 
    /FontDescriptor 17 0 R 
    /CIDSystemInfo << /Registry (Adobe)/Ordering (Japan1)/Supplement 2 >> 
    /DW 1000 
    /W [ 231 [ 500 ] ] 
    >> 
endobj
19 0 obj
    << 
    /Type /ExtGState 
    /SA false 
    /SM 0.02 
    /TR2 /Default 
    >> 
endobj
20 0 obj
    << /N 3 /Alternate /DeviceRGB /Length 2572 /Filter /FlateDecode >> 
    stream
    ...
    endstream
endobj
...

See http://stackoverflow.com/questions/128162/unicode-in-pdf — Remy Lebeau, Mar 16 '14 at 05:55
I've figured out that there is an Identitiy-H "Encoding/Encryption" to the texts, but no shred of evidence to the /ToUnicode map in the file... any pointer? — TacB0sS, Mar 16 '14 at 14:57
The hex octets you have shown do not appear to be a standard UTF or ANSI charset encoding, and it contains octets outside of the ASCII visual range (0x00, 0x03, 0x0E, etc), so there is likely another layer of encoding involved that you are not accounting for yet. — Remy Lebeau, Mar 16 '14 at 17:55
*you need a /ToUnicode map which I cannot seem to find in the file. What drives me nuts is that other PDF Viewers can show the PDF, and I cannot figure how!* - You need **ToUnicode** to canonically retrieve the Unicode characters represented by the PDF, not to *show the PDF*. To show it, you merely need to understand fonts. In your case `/Registry (Adobe)/Ordering (Japan1)/Supplement 2` have a look at Adobe Technical Note #5078-b [Adobe-Japan1-2 Character Collection for CID-Keyed Fonts](http://ftp.ktug.org/obsolete/info/adobe/devtechnotes/pdffiles/5078b.pdf). — mkl, Mar 16 '14 at 21:22
mkl, "Show the DPF" was a poor choice of words(frustration...) , I've compared the hex with the fonts and you are correct these are the fonts I need!! So what you are saying is that I need to map the hex values to the characters in the file.... if I got this right this should hold the key to my troubles: http://sourceforge.net/projects/cmap.adobe/files/cmapresources_japan1-6.tar.z/download question is... how do I make sense of all this data? off to a new adventure :) — TacB0sS, Mar 17 '14 at 00:27
mkl, I look at the table file... and there are 30+ columns... how do I determine which column is the one I need? for a CID value there are different values in each column if I get the column wrong... what would I be showing the user? also the naming convention does not look like anything I can derive from the PDF. — TacB0sS, Mar 17 '14 at 20:45
@TacB0sS Unfortunately SO did not inform me about your responses. Use the @ plus username to make sure the person in question is informed. That been said, I'm not really a cmap expert and, therefore, hardly can help with the tables... — mkl, Mar 18 '14 at 20:46

score 2 · Answer 1 · answered Mar 16 '14 at 19:49

Here is your problem:

I figured out that there is an extra "encryption", Identity-H, and I've read here that you need a /ToUnicode map which I cannot seem to find in the file.

That indicates the two-byte hex codes in your text strings are immediate glyph indexes into the original font file. Search the font file for a Unicode character map (one of its cmap entries); this will provide the link from glyph index to Unicode.

Note that it's possible that a glyph index does not translate immediately to a Unicode codepoint. A GSUB or GPOS OpenType table may have taken one or more Unicode characters as input and substituted them with another glyph in the output string. It's also possible (but less likely) the original creator inserted raw glyphs manually.

score 2 · Accepted Answer · edited Sep 30 '15 at 18:02

Since most thoughts here are fundamentally correct, they are not complete and not exact, so:

The /ToUnicode MAY be present in the PDF file, but is not a must!!!
There are external, predetermined/predefined CMaps for multiple languages, here.

It was pretty frustrating to dig so long in the wrong place, I've tared the PDF into tiny pieces and have went through all the streams in the file, to find this map without luck, because it WAS NOT IN THE FILE!

I hope this save someone else the hassle...

Read Japanese characters in a PDF file

2 Answers2

Linked