I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards
-
Please share the PDF in question. – mkl Apr 02 '19 at 15:31
-
Thanks a lot. Here is link for the PDF sample file [https://send.firefox.com/download/fd4963601a60d0a6/#xoDMk9Eenx52li1dOGNQeQ](https://send.firefox.com/download/fd4963601a60d0a6/#xoDMk9Eenx52li1dOGNQeQ) – ariefcfa Apr 02 '19 at 17:58
-
1You didn't specify exactly where the weird characters are, but I can see that for any ligatures in the PDF, there is incorrect Unicode output. For example, instead of "Metallgesellschaft fired" I get "Metallgesellschaft ®red", where "fi" is the unicode ligature U+FB01. This is an issue with the software making the PDF, not generating a correct ToUnicode map in the PDF. – Ryan Apr 02 '19 at 21:07
-
@Ryan At first I thought the issue is in the whole file, because only Okular recognize the text. For the specific mentioned words, Sumatra PDF recognize it as "I???????% ?????????? ????" and Adobe as "I% ". But now I can confirm, Okular also incorrectly recognize "Metallge-sellschaft fired" as Metallge-sellschaft ®red". Glad to know where the issue is, any way to correct ToUnicode map in PDF? Thanks – ariefcfa Apr 03 '19 at 08:39
1 Answers
In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.
Mapping character codes to Unicode as described in the PDF specification
The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.
It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.
Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.
In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:
If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.
What happens if the algorithm above fails to produce a Unicode value
This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.
That the different programs you tried returned so different results shows that
your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.
What to do in such a case
There are multiple options, more or less feasible depending on your concrete case:
Ask the source of the PDF for a version that contains proper information for text extraction.
Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.
Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".
Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

- 90,588
- 15
- 125
- 265
-
Many thanks. This explain all, especially about my persist to why a program can extract the text, but others program can't. ♂️ – ariefcfa Apr 03 '19 at 20:29
-
Please somebody kindly elaborate the code in [how to add unicode in truetype0font on pdfbox 2.0.0?](https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0). I can see glyph character **12** is mapped to unicode hex value **0064**, but as in PDFDebugger screenshot, the latin small letter D has Code/CID/GID **18**. And finally, how to run this code? Is it possible to map it directly in PDFDebugger ToUnicode CMap table? For me OCR as another solution is not very ideal, IMO although through the best route (text layer overlay) because the text accuracy – ariefcfa Apr 05 '19 at 07:49
-
Apologize for coming back here, I can't comment in the answer section of the mentioned thread above. – ariefcfa Apr 05 '19 at 07:56
-
1@ariefcfa *"I can see glyph character 12 is mapped to unicode hex value 0064, but as in PDFDebugger screenshot, the latin small letter D has Code/CID/GID 18."* - *All* the mapping data in the **ToUnicode** CMap are written as hexadecimal numbers, including the character codes. Thus, the `<0012>` as decimal number is 18! The PDFDebugger outputs the numbers as decimal numbers. Thus, the numbers you see do match after all! – mkl Apr 05 '19 at 09:51
-
1@ariefcfa *"how to run this code"* - It's Java code and it's based on the PDFBox library. Thus, you compile it with the PDFBox jars and its dependencies and then run the compiled code. – mkl Apr 05 '19 at 09:55
-
@ariefcfa *"Is it possible to map it directly in PDFDebugger ToUnicode CMap table?"* - AFAIK it is not possible. – mkl Apr 05 '19 at 09:55
-
*"Apologize for coming back here, I can't comment in the answer section of the mentioned thread above."* - In such a case don't hesitate to create a new stack overflow question in which you reference the old question and answer and ask your questions. – mkl Apr 05 '19 at 09:58
-