I am using python 3
on windows 10
(though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer
and textract
, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X
and export to .txt
, My output looks like:
!!
F/.....e..................!
216.. ..... .... ....
........
If I output to .doc
, .docx
, .rtf
, or even copy-paste into any text editor, it looks like this:
ҁϦљӢख़ε༊ݢቹៜϐѦჾѱ॥ᓀϩӵΠ
I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf
which I already have installed (it appears to automatically come with windows).
I do have a workaround, but it's ugly and very difficult to automate; I print the pdf to a pdf (or any sort of image), then use adobe pro's built-in OCR, then convert to a word document (it still does not convert correctly to .txt). Ultimately I need to do this for ~2000 documents, each of which can be up to 200 pages.
Is there any other way to do this? Why is exporting or copy-pasting not working correctly? I have uploaded a 2-page sample to google drive here.