PDF Copy Text Issue: Weird Characters

Question

I tried to copy text from a PDF file but get some weird characters. Strangely, Okular can recoqnize the text, but not with Sumatra PDF or Adobe, all three applications are installed in Windows 10 64 bit. To better explain my issue, here is the video https://streamable.com/sw1hc. The "text layer workaround file" is one solution I got. Any help is greatly appreciated. Regards

Thanks a lot. Here is link for the PDF sample file [https://send.firefox.com/download/fd4963601a60d0a6/#xoDMk9Eenx52li1dOGNQeQ](https://send.firefox.com/download/fd4963601a60d0a6/#xoDMk9Eenx52li1dOGNQeQ) — ariefcfa, Apr 02 '19 at 17:58
You didn't specify exactly where the weird characters are, but I can see that for any ligatures in the PDF, there is incorrect Unicode output. For example, instead of "Metallgesellschaft fired" I get "Metallgesellschaft ®red", where "fi" is the unicode ligature U+FB01. This is an issue with the software making the PDF, not generating a correct ToUnicode map in the PDF. — Ryan, Apr 02 '19 at 21:07
@Ryan At first I thought the issue is in the whole file, because only Okular recognize the text. For the specific mentioned words, Sumatra PDF recognize it as "I???????% ?????????? ????" and Adobe as "I% ". But now I can confirm, Okular also incorrectly recognize "Metallge-sellschaft fired" as Metallge-sellschaft ®red". Glad to know where the issue is, any way to correct ToUnicode map in PDF? Thanks — ariefcfa, Apr 03 '19 at 08:39

mkl · Accepted Answer · 2019-04-05T10:10:00.047

In short: The (original) PDF does not contain the information required for regular text extraction as described in the PDF specification. Depending on the exact nature of your task, you might try to add the required information to the existing text objects and fonts or you might go for OCR.

Mapping character codes to Unicode as described in the PDF specification

The PDF specification ISO 32000-1 (and similarly ISO 32000-2, too) describes an algorithm for mapping character codes to Unicode values using information available directly inside the PDF.

It has been quoted very often in other stack overflow answers (see here, here, here, here, here, or here), so I won't quote it here again.

Essentially this is the algorithm used by Adobe Acrobat during copy&paste and also by many other text extractors.

In PDFs which don't contain the information required for text extraction, you eventually get to this point in the algorithm:

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.

What happens if the algorithm above fails to produce a Unicode value

This is where the text extraction implementations differ, they try to determine the matching Unicode value by using heuristics or information from beyond the PDF or applying OCR to the glyph in question.

That the different programs you tried returned so different results shows that

your PDF does not contain the information required for the algorithm above from the PDF specification and
the heuristics used by those programs differ relevantly and Okular's heuristics work best for your document.

What to do in such a case

There are multiple options, more or less feasible depending on your concrete case:

Ask the source of the PDF for a version that contains proper information for text extraction.

Unless you have a contract with that source that requires them to supply the PDFs in a machine readable form or the source is otherwise obligated to do so, they usually will decline, though...
Apply OCR to the PDF in question.

Depending on the quality of the OCR software and the glyphs in the PDF, the results can be of a questionable quality; e.g. in your "PDF copy text issue-Text layer workaround.pdf" the header "Chapter 1: Derivative Securities" has been recognized as "Chapter1: Deratve Securites"...
You can try to interactively add manually created ToUnicode maps to the PDF, e.g. as described by Tilman Hausherr in his answer to "how to add unicode in truetype0font on pdfbox 2.0.0".

Depending on the number of different fonts you have to create the mappings for, this approach might easily require way too much time and effort...

Many thanks. This explain all, especially about my persist to why a program can extract the text, but others program can't. ‍♂️ — ariefcfa, Apr 03 '19 at 20:29
Please somebody kindly elaborate the code in [how to add unicode in truetype0font on pdfbox 2.0.0?](https://stackoverflow.com/questions/39485920/how-to-add-unicode-in-truetype0font-on-pdfbox-2-0-0). I can see glyph character **12** is mapped to unicode hex value **0064**, but as in PDFDebugger screenshot, the latin small letter D has Code/CID/GID **18**. And finally, how to run this code? Is it possible to map it directly in PDFDebugger ToUnicode CMap table? For me OCR as another solution is not very ideal, IMO although through the best route (text layer overlay) because the text accuracy — ariefcfa, Apr 05 '19 at 07:49
Apologize for coming back here, I can't comment in the answer section of the mentioned thread above. — ariefcfa, Apr 05 '19 at 07:56
@ariefcfa *"I can see glyph character 12 is mapped to unicode hex value 0064, but as in PDFDebugger screenshot, the latin small letter D has Code/CID/GID 18."* - *All* the mapping data in the **ToUnicode** CMap are written as hexadecimal numbers, including the character codes. Thus, the `<0012>` as decimal number is 18! The PDFDebugger outputs the numbers as decimal numbers. Thus, the numbers you see do match after all! — mkl, Apr 05 '19 at 09:51
@ariefcfa *"how to run this code"* - It's Java code and it's based on the PDFBox library. Thus, you compile it with the PDFBox jars and its dependencies and then run the compiled code. — mkl, Apr 05 '19 at 09:55
@ariefcfa *"Is it possible to map it directly in PDFDebugger ToUnicode CMap table?"* - AFAIK it is not possible. — mkl, Apr 05 '19 at 09:55
*"Apologize for coming back here, I can't comment in the answer section of the mentioned thread above."* - In such a case don't hesitate to create a new stack overflow question in which you reference the old question and answer and ask your questions. — mkl, Apr 05 '19 at 09:58

PDF Copy Text Issue: Weird Characters

1 Answers1

Mapping character codes to Unicode as described in the PDF specification

What happens if the algorithm above fails to produce a Unicode value

What to do in such a case

Linked