In PDF, if Encoding and ToUnicode are both present in PDF, how to map the text extraction?

Question

I used qpdf to uncompress a PDF file and below is the output. You can see that there both, encoding and ToUnicode, are present. If there is only ToUnicode I know how to map individual characters with Cmap file. But if you see output of Content stream is following

Tf
0.999402 0 0 1 71.9995 759.561 Tm
[()-2.11826()-1.14177()2.67786()-2.11826()8.55269()-5.44998()-4.70186()2.67786()-2.32338()2.67786()12.679(   )-3.75591()9.73429()]TJ

in break-at there are some garbage data that is not visible. So how to link data to cmap file ?

And one another question is that in /Encoding what are values contain in Difference ?

10 0 obj
<< /BaseEncoding /WinAnsiEncoding /Differences [ 1 /g100 /g28 /g94 /g3 /g87 /g24 /g38 /g47 /g62 ] /Type /Encoding >>

Even if I pass one by one values of Difference array into one of FreeType function is named as FT_Get_Name_Indek. This function return values like [ 100 28 94 3 87 24 38 47 62]

What is those values ? how to map those Value ?

here is pdf

run following cmd

qpdf --stream-data=uncompress input.pdf output.text

output.text

I got the same output if I pass contents stream data into zlib. kindly check output.txt file from link

*in break-at there are some garbag data that is not visible* - completely wrong. In the brackets there are the values identifying your glyphs, so they definitively are not garbage. — mkl, Oct 14 '16 at 11:14
yes there are some value. i know while but in this case i didn't get the values.. using those values only we can map with CMap file..why i am not getting i don't know...please tell me if you know — pratik solanki, Oct 14 '16 at 12:27
i upload pdf file please kindly check out . you get know what i am traying to told you.. run this cmd qpdf --stream-data=uncompress input.pdf output.text this cmd will give you output.text file that same file currently i displayed u above — pratik solanki, Oct 14 '16 at 13:04
Ah, that is your misunderstanding: `qpdf --stream-data=uncompress` does not give you a text file, you can merely recognise more of it in a text editor but it still is a binary file. — mkl, Oct 14 '16 at 13:23
yes it give you. through that only i conclude.. and one more thing even i pass contents data see /Contents 5 0 R into zlib . i got same thing that got from qpdf tool . — pratik solanki, Oct 14 '16 at 13:31

score 4 · Accepted Answer · answered Oct 17 '16 at 12:42

Firstly the general question

how to exract the text in pdf if encoding and ToUnicode both are present in pdf? how to map it?

[...] if you see there are encoding and ToUnicode both are present in pdf. i know if only ToUnicode is there so how to map individual char with Cmap file.

In such a case, i.e. when you have both a sufficiently complete and correct ToUnicode map and an Encoding for a font, you can ignore the Encoding and only use the ToUnicode map.

This follows from the PDF specification which in section 9.10.2 "Mapping Character Codes to Unicode Values" states that the methods to map a character code to a Unicode value with the highest priority is

If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

Thus, if you (as you say) already know how to extract text if there only is a ToUnicode map, you can use the same algorithm unchanged. And as a corollary, if that doesn't work, the ToUnicode map in question is insufficiently complete or incorrect, or your knowledge itself on how to extract text using only a ToUnicode map actually is incomplete.

Secondly the sample document

You wrote

[()-2.11826()-1.14177()2.67786()-2.11826()8.55269()-5.44998()-4.70186()2.67786()-2.32338()2.67786()12.679( )-3.75591()9.73429()]TJ

in break-at there are some garbag data that is not visible. so how to link data to cmap file ?

In the brackets there are the values identifying your glyphs, so they definitively are not garbage.

Thus, here are the byte values from within the brackets:

[(
    01
)-2.11826(
    02
)-1.14177(
    03
)2.67786(
    01
)-2.11826(
    04
)8.55269(
    05
)-5.44998(
    06
)-4.70186(
    07
)2.67786(
    04
)-2.32338(
    07
)2.67786(
    08
)12.679(
    09
)-3.75591(
    02
)9.73429(
    04
)]TJ

Using the ToUnicode map of the font in question

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
1 begincodespacerange
<00><ff>
endcodespacerange
9 beginbfrange
<01><01><0054>
<02><02><0045>
<03><03><0053>
<04><04><0020>
<05><05><0050>
<06><06><0044>
<07><07><0046>
<08><08><0049>
<09><09><004c>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

the byte values from within the brackets map to:

    01    0054    "T"
    02    0045    "E"
    03    0053    "S"
    01    0054    "T"
    04    0020    " "
    05    0050    "P"
    06    0044    "D"
    07    0046    "F"
    04    0020    " "
    07    0046    "F"
    08    0049    "I"
    09    004c    "L"
    02    0045    "E"
    04    0020    " "

Thus,

"TEST PDF FILE "

which matches the rendered file just fine:

Thirdly the encoding

and one another question is that in /Encoding what are values contain in Difference ?

10 0 obj << /BaseEncoding /WinAnsiEncoding /Differences [ 1 /g100 /g28 /g94 /g3 /g87 /g24 /g38 /g47 /g62 ] /Type /Encoding >>

According to the PDF specification,

The value of the Differences entry shall be an array of character codes and character names organized as follows:

code₁ name_1,1 name_1,2 …

code₂ name_2,1 name_2,2 …

…

code_n name_n,1 name_n,2 …

Each code shall be the first index in a sequence of character codes to be changed. The first character name after the code becomes the name corresponding to that code. Subsequent names replace consecutive code indices until the next code appears in the array or the array ends. These sequences may be specified in any order but shall not overlap.

Thus, the encoding entry in your case says that the encoding basically is WinAnsiEncoding with the difference that the codes 1, ..., 9 instead represent the glyphs named /g100, /g28, /g94, /g3, /g87, /g24, /g38, /g47, and /g62 respectively.

As these glyph names are no standard glyph names, the PDF specification does not consider this encoding helpful for text extraction because it only describes a method for a simple font

that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D)

The "/gXX" names in your sample clearly are not among them.

Hi, i too have used qpdf to uncompress one of my pdf file. but i cannot understand what my content is. I have a ToUnicode in the uncompressed file - https://pastebin.com/mfznMhm6 i used qpdf -qdf option can i get the content of the file ? — mrtechmaker, Feb 28 '19 at 06:50
*"i cannot understand what my content is"* - You should study the PDF specification to understand what you see. If you have specific questions (*"i cannot understand what my content is"* is not specific at all), create detailed stack overflow questions for them. — mkl, Feb 28 '19 at 11:08

score 0 · Answer 2 · answered Jun 21 '21 at 08:59

It's worth observing that most of the time the /Encoding map is a character codes (intended as the encoded bytes of a string) to CID map, where CID (Character ID) in most font types corresponds to a glyph index/identifier. The exception appears to with Type2 fonts which have separate CID and GID (Glyph ID) concepts, supplying a /CIDToGIDMap to convert between them. In the above cases the /Encoding map has nothing to do with decoding an Unicode representation of the string. To decode the Unicode representation you definitely should use the /ToUnicode when available, as pointed bt @mkl. If it is not available, you are in one case where you either have a predefined encoding (optionally with a /Difference map) or CMap, or you a in a case where the font program supplies an implicit encoding, like in Type1 fonts. This is all stated in the very good @mkl answer as well. /Encoding could possibly corresponds to the map to convert between the character codes and Unicode code points when it's either a predefined encoding (like MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, but I also saw use of possibly non compliant Identity-H, which is a predefined CMap name, not a predefined encoding) or in a supposedly malformed font. With this regard PDF reference/standard is often confusing about what is legal and what is not, so a library decoding encoded strings in PDF should always be as lenient as possible. Also the PDF reference/standard itself is not much clear in explaining the distinction between character codes, CID, GID and Unicode representations.

In PDF, if Encoding and ToUnicode are both present in PDF, how to map the text extraction?

2 Answers2

Firstly the general question

how to exract the text in pdf if encoding and ToUnicode both are present in pdf? how to map it?

Secondly the sample document

Thirdly the encoding