Getting Chinese text from pdf, font encoding issue

Question

I am using python 3 on windows 10 (though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer and textract, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X and export to .txt, My output looks like:

!! 
F/.....e..................! 
216.. ..... .... .... 
........

If I output to .doc, .docx, .rtf, or even copy-paste into any text editor, it looks like this:

ҁϦљӢख़ε༊౗ݢ୏ቹៜϐѦჾѱ൑॥ᓀϩ݋ӵΠ

I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf which I already have installed (it appears to automatically come with windows).

I do have a workaround, but it's ugly and very difficult to automate; I print the pdf to a pdf (or any sort of image), then use adobe pro's built-in OCR, then convert to a word document (it still does not convert correctly to .txt). Ultimately I need to do this for ~2000 documents, each of which can be up to 200 pages.

Is there any other way to do this? Why is exporting or copy-pasting not working correctly? I have uploaded a 2-page sample to google drive here.

*"I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste."* - Because a PDF document may contain the correct information to draw a specific embedded anonymous glyph for a given character code and at the same time may miss the information which Unicode code point corresponds to that character code or glyph. — mkl, Nov 13 '18 at 11:00
Getting Unicode right in PDFs is complicated. See [@dredkin's answer on Unicode in PDF](https://stackoverflow.com/a/36820254/240443) for details. It's likely your PDFs are missing the /ToUnicode mapping. Any ideas how the PDFs were produced? — Amadan, Nov 13 '18 at 11:05
Actually your example PDF does contain information which Unicode code point corresponds to a character code or glyph (a **ToUnicode** table for each font), but *these mappings are incorrect*. You could say your PDF *lies* to text extractors. — mkl, Nov 13 '18 at 11:09
@Leo - those examples were all from the first line of the pdf, which should be `E.本公司因重大匯率波動影響之外幣市場風險分析如下:` @Amadan - these are public financial reports for companies, I personally can't guess at how any particular pdf was produced. @mkl - is there any way to correct this? Also, i'm not sure I understand completely - how does adobe know which character to draw but also not know which unicode character it is? why are these two not synonymous? — tigerninjaman, Nov 13 '18 at 12:11
@tigerninjaman I meant, add your code snippet which is throwing error. — Hayat, Nov 13 '18 at 13:14
I think you misunderstood. There is no code throwing an error, the built-in adobe `export` or even copy-paste are not functioning correctly due to a bad /ToUnicode map. — tigerninjaman, Nov 14 '18 at 01:50

Getting Chinese text from pdf, font encoding issue

0 Answers0