2

I am currently working on iOS PDF scanning using PDFKitten. I am trying to extract text for searching in PDF having Type0 font. I am not able to extract text from the PDF. Some entries in ToUnicode are missing and some are misinterpreted. Can there be issue with parsing of the CMap? If I don't have complete CMap, how should I derive it? Can I take external entries for these missing ToUnicode entries?

Thanks

Swaroop
  • 501
  • 4
  • 18

2 Answers2

5

The PDF specification offers hints on how to extract text content in section 9.10.2 Mapping Character Codes to Unicode Values:

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

  • If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

    a) Map the character code to a character identifier (CID) according to the font’s CMap.

    b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

    c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registry–ordering–UCS2 (for example, Adobe–Japan1–UCS2).

    d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

    e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

Furthermore, as section 9.10.1 indicates,

  • An ActualText entry for a structure element or marked-content sequence (see 14.9.4, "Replacement Text") may be used to specify the text content directly

According to the specification, if these methods fail to produce a Unicode value, there is no way to determine what the character code represents. This is not entirely true; e.g. embedded font programs may contain their own mappings to Unicode; but such additional sources of information are beyond the actual PDF format.

EDIT

The OP provided the file in question, iPhoneConfigurationProfileRef-2013-GM.pdf, via mail and indicated

I am getting problem for every glyph.

The issue is that ranges present in PDF are not complete and are different from adobe-identity-cmap file.

If I only use CMap embedded in PDF, I get no mapping for every character and if I use adobe one the all mappings are wrong.

As he didn't get a mapping for any glyph, let us look at the title page as an example.

The content stream contains these operation relevant for text extraction:

BT 
50 0 0 50 60 669.225 Tm 
/G1 1 Tf 
<0025> Tj 
ET 
BT 
50 0 0 50 87.6 669.225 Tm 
/G1 1 Tf 
<005100500048004b004900570054> Tj 
ET 
BT 
50 0 0 50 238 669.225 Tm 
/G1 1 Tf 
<0043> Tj 
ET 
BT 
50 0 0 50 261.45 669.225 Tm 
/G1 1 Tf 
<0056004b00510050> Tj 
ET 
BT 
50 0 0 50 355.4 669.225 Tm 
/G1 1 Tf
<0032> Tj 
ET 
BT 
50 0 0 50 380.75 669.225 Tm 
/G1 1 Tf 
<0054> Tj 
ET 
BT 
50 0 0 50 396.55 669.225 Tm 
/G1 1 Tf 
<00510048004b004e0047> Tj 
ET 
BT 50 0 0 50 60 609.225 Tm 
/G1 1 Tf 
<0034> Tj 
ET 
BT 
50 0 0 50 86.65 609.225 Tm 
/G1 1 Tf 
<00470048> Tj 
ET 
BT
50 0 0 50 125.05 609.225 Tm 
/G1 1 Tf 
<00470054> Tj 
ET 
BT 
50 0 0 50 165.45 609.225 Tm 
/G1 1 Tf 
<004700500045> Tj 
ET 
BT 
50 0 0 50 238.9 609.225 Tm 
/G1 1 Tf 
<0047> Tj 
ET

So we need to look only at the font G1 on page 1. Fortunately the font has a ToUnicode map:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
  /Registry (Adobe)
  /Ordering (UCS)
  /Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000><FFFF>
endcodespacerange
1 beginbfchar
<000f><002d 2010>
endbfchar
15 beginbfrange
<0002><0002><0020>
<0004><000c><0022>
<000e><000e><002c>
<0010><001d><002e>
<001f><001f><003d>
<0022><0032><0040>
<0034><003d><0052>
<003f><003f><005d>
<0041><0041><005f>
<0043><005c><0061>
<005e><005e><007c>
<008a><008a><00a9>
<00a4><00a4><2014>
<00a5><00a6><201c>
<00a8><00a8><2019>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end 

Trying to apply this map one gets (based on the explicit beginbfrange...endbfrange entries):

<0025> Tj                          % "C"       = <0043>                         due to <0022><0032><0040>
<005100500048004b004900570054> Tj  % "onfigur" = <006f006e00660069006700750072> due to <0043><005c><0061>
<0043> Tj                          % "a"       = <0061>                         due to <0043><005c><0061>
<0056004b00510050> Tj              % "tion"    = <00740069006f006e>             due to <0043><005c><0061>
<0032> Tj                          % "P"       = <0050>                         due to <0022><0032><0040>
<0054> Tj                          % "r"       = <0072>                         due to <0043><005c><0061>
<00510048004b004e0047> Tj          % "ofile"   = <006f00660069006c0065>         due to <0043><005c><0061>
<0034> Tj                          % "R"       = <0052>                         due to <0034><003d><0052>
<00470048> Tj                      % "ef"      = <00650066>                     due to <0043><005c><0061>
<00470054> Tj                      % "er"      = <00650072>                     due to <0043><005c><0061>
<004700500045> Tj                  % "enc"     = <0065006e0063>                 due to <0043><005c><0061>
<0047> Tj                          % "e"       = <0065>                         due to <0043><005c><0061>

This very well matches the appearance of the page:

scrrenshot of the title page

mkl
  • 90,588
  • 15
  • 125
  • 265
  • 1
    Thanks for the reply.. I am facing this issue for Adobe-Identity-UCS (Type 0 with CIDType2 as descendent) and I have adobe CMAP for the same. I have parsed that CMAP and still some entries are missing and some mappings are incorrect. How can I fix this? – Swaroop Nov 13 '14 at 14:43
  • 1
    What about showing the document you have problems with? – Jan Slabon Nov 13 '14 at 15:22
  • @Swaroop text extraction from your PDF seems to go beyond what the specification considers a normal extraction task. Thus, please do share the PDF in question and indicate with which glyphs you have trouble. – mkl Nov 13 '14 at 22:12
  • 1
    @mkl - Hi I am currently try to parse iPhoneConfigurationProfileRef-2013-GM.pdf. However some ranges are missing in Adobe-Identity-UCS CMap from adobe and only few ranges are present Font dictionary. When I use adobe CMap every character is mis-matched. Extracted text is mess. What procedure should I follow? Can you share your email address with me so that I can mail the pdf to you. – Swaroop Nov 14 '14 at 07:07
  • 1
    @Setasign - document is shown perfectly as font programs are loaded correctly for rendering the glyphs. The problem occurs when I try to search the text. Unicode mappings are not completely available in ToUnicode. – Swaroop Nov 14 '14 at 07:10
  • @Swaroop You find an email address in my profile. If the file is freely accessible on the net, though, you can also share the link here. – mkl Nov 14 '14 at 08:29
  • @mkl can you please help to fix this CMap Issue for Type 0 font? – Swaroop Nov 21 '14 at 06:02
  • @Swaroop I added the manual text extraction of the title page text of your document. It actually was pretty straight-forward. – mkl Nov 21 '14 at 11:08
  • @Swaroop *how u did it? can you share the code?* - manually. I extracted the streams in question using a PDF object browser (RUPS, if it matters), deleted the uninteresting clipping path, vector graphics, etc operations from the content, and used the **ToUnicode** map entries to map the glyphs to Unicode (I actually quoted the respective range entry used for mapping each string after "due to"; interestingly I needed only one range entry per string which is not usual). Open source libraries return the same result, e.g. iText, PDFBox, or PDFClown. So if you want some source, browse their repos. – mkl Nov 21 '14 at 11:40
  • *Open source libraries return the same result, e.g. iText, PDFBox, or PDFClown. So if you want some source, browse their repos.* - Those libraries are Java libraries and, therefore, probably not usable for you as is. The code is fairly well readable, though, and so can serve as a source of inspiration. – mkl Nov 21 '14 at 11:48
  • I am using PDFKitten for iOS. I found it has may bugs in parsing CMap. May be first I should fix that. I will go through the code.. thanks. – Swaroop Nov 21 '14 at 12:02
  • @Swaroop Yes, it has some bugs, also see [here](http://stackoverflow.com/a/12932653/1729265) or [here](http://stackoverflow.com/a/14167198/1729265). – mkl Nov 21 '14 at 12:44
  • @mkl I was able to extract the text it was parser and mapping issue in PDFKitten. I resolved it and I was able to extract the text. Thanks a lot for the help. – Swaroop Nov 21 '14 at 14:23
  • Great! Please consider forwarding your fixes to the author of PDFKitten. – mkl Nov 21 '14 at 16:13
2

To whom it may concern, if some entries are missing from /ToUnicode CMap and the font is not referring to one of the predefined encodings/CMap(s) then Adobe Reader/Acrobat will behave like the following as I empirically observed:

  • If the encoding is 1 byte code space size it will directly interpret the missing code as an Unicode code point (basically an identity cast);
  • If the encoding is 2 byte code space size it will just decode the code as an invalid character.

I didn't test other combinations like variable code space size. This helped me to correctly perform text extraction in a PDF with a 1 byte encoding font that was listing only alphabetic characters in the /ToUnicode CMap, leaving out some punctuation which was encoded with the regular ASCII code.

ceztko
  • 14,736
  • 5
  • 58
  • 73