Extract toUnicode map from One PDF and use in another

Question

I have a Unicode PDF document which misses the toUnicode map. I have a different PDF with the same font which has the toUnicode map. Can I extract it from one PDF and use it to extract text from the other PDF?

score 6 · Answer 1 · answered Apr 30 '14 at 14:54

For Unicode mapping Adobe has special resource /ToUnicode You can find it in the pdf file inside of Font resource description. It looks like

<</BaseFont /ONWALI+Sylfaen/DescendantFonts [10 0 R]/Encoding /Identity-H/Subtype /Type0/ToUnicode 11 0 R/Type /Font>>

and /ToUnicode 11 0 R is that you need to have in the pdf file. 11 0 is a resource ID

I've created sample pdf with all alphabet symbols in Acrobat Pro to have standard ToUnicode mapping using the same font that is used in the report. I've extracted resource as text, it looks something like:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
50 beginbfchar
<0003> <0020>
...and so on...
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end

ToUnicode resource is compressed usually so you have to decompress it to get text like above.

Then I've wrote code that takes pdf (from generated report in Misrosoft Reporting) and adds /ToUnicode resource for each font found. Pdf have xref table with pointers and you cann't edit it as text file. So you have to use some pdf engine (I've used PDFTron but itext should be enough). This post-processing code is executed each time I need to save report as pdf. Actually ToUnicode mapping should be filled by Microsoft Reporting engine, but it is too good to be true.

That's it.

I'm trying to replicate this with iText, so far I see the ToUnicode stream embedded in the pdf and the /ToUnicode entry added to the /Font dictionary, but when I open the pdf it's still not mapping the characters. Any suggestions? — sq33G, Mar 03 '15 at 12:14
@sq33G try to save uncompressed stream to check how mapping is saved manually. I suppose you have not mapped all characters. You can use Adobe Pro to validate pdf. Adobe validation shows character codes that are missed in the mapping. — oleksa, Mar 03 '15 at 15:40
Awesome! I had the wrong CMapName - it had to match the Registry/Ordering/Supplement of the font. Beautiful! — sq33G, Mar 04 '15 at 08:37

David van Driessche · Answer 2 · 2012-12-02T13:56:35.583

3

The generic answer is no. The ToUnicode map you are talking about follows the PDF CMap format and is used to translate character codes into Unicode values. You face two potential pitfalls:

1) The fonts are not exactly the same. While their name may be the same, they might have a different encoding, or might contain different glyphs (even for the same encoding). In that case applying the CMap from a different font would give you incorrect unicode values.

2) The fonts may be the same in all aspects but may be subsetted in the PDF file (likely) and the subset may be different. There are certainly cases where that wouldn't change the way the font is stored in the PDF file, but there are optimising PDF writers that will condense anything they can in subsetted fonts, which may give rise to different character codes being used and ultimately different ToUnicode maps.

edited Dec 02 '12 at 13:56

answered Dec 02 '12 at 10:21

David van Driessche

6,602
2
28
41

1

David the PDF was generated out of Microsoft SSRS, There is only one Font Used, I am willing to create a cmap file manually if needed. I have to extract text from thousands of PDF and can manually try to map char codes to unicode values. I need recommendation on code which can help me achieve this. – Naresh Jois Dec 02 '12 at 13:01
Is the font subsetted by Microsoft SSRS? And all files you have are generated by the same application? If the same font is used in all cases and the same generating application is used, I would think it to be worthy of an attempt to simply copy an existing ToUnicode CMap into the other file. – David van Driessche Dec 02 '12 at 14:00
What is the Producer? Is it iTextSharp? (Open Document Properties to find the answer.) If we know the producer, we know what to expect. For instance: if the font is subsetted, your attempt will fail: not all documents need the same characters. – Bruno Lowagie Dec 02 '12 at 15:54
All the documents have been created by the same application, The PDF Producer is reported as Microsoft Reporting Services PDF Rendering Extension 10.0. PDF Version is 1.3 , the data in document contains all possible options of the font, (but the font is embedded subset), I have a large number of files thus willing to do put in the effort. – Naresh Jois Dec 02 '12 at 17:09
A stupid question perhaps, but have you tried extracting the text in the files that miss the ToUnicode mapping? Using Adobe Acrobat Pro for example? (I'm not affiliated to Adobe, but Acrobat Pro is an outstanding PDF citizen usually) – David van Driessche Dec 02 '12 at 17:49
1

@NareshJois would you supply two or three samples to inspect? Depending on how fonts are embedded the character identifiers may be assigned incrementally for each glyph embedded in the order of their first appearance in the document. In that case you will not find a common map to add to all documents. – mkl Dec 02 '12 at 19:13
@DavidvanDriessche : I tried extracting using Adobe Acrobat XI Pro and the copied conent comes out mostly as boxes – Naresh Jois Dec 04 '12 at 07:01
@mkl : I have uploaded one file i will if more a re needed : [Skydrive PDF](https://skydrive.live.com/redir?resid=78E5D5E909CB3605!864) – Naresh Jois Dec 04 '12 at 07:12
@NareshJois First of all: For which of the fonts in the file do you want to add toUnicode maps? Already the first page has 5 font resources, 4 subsets of Arial Unicode and one subset of Tunga. And to check whether the toUnicode map you have could work, we'd need the PDF you have with that map (or at least the map itself). – mkl Dec 04 '12 at 09:18
@NareshJois Hi Could you please add a short note how this problem has been solved ? – oleksa Apr 16 '14 at 13:44
@oleksa : I still haven't been able to solve this, (I tried Adobe Acrobat also) , If you solve this please let me know. – Naresh Jois Apr 30 '14 at 10:23
@NareshJois Too much to wrote a comment, please find it as answer below – oleksa Apr 30 '14 at 14:55
@mkl I have a question on this topic. Would be great , if you can view it : [different cmaps for different pdfs, global cmap needed](https://stackoverflow.com/q/57952798/8030107) – aspiring1 Sep 16 '19 at 08:15

Extract toUnicode map from One PDF and use in another

2 Answers2

Linked