How to extract simplied chinese from remote pdf file in url by python ? The output of my codes turns out to be empty

Asked Nov 20 '18 at 07:53

Active Nov 20 '18 at 07:53

Viewed 525 times

I wanna save the company announcement of a listco from a pdf url However, the output file of my python code turns out to be empty.

I tried to extract the text from the pdf directly however, those are simplified chinese and even utf-16 cannot completely decode it.

Please help

import requests
from PyPDF2 import PdfFileReader, PdfFileWriter

url_pdf='http://static.sse.com.cn/disclosure/listedinfo/announcement/c/2018-11-15/601318_20181115_1.pdf'
r = requests.get(url_pdf)
fo = open('file_name.pdf','wb')                        
fo.write(r.content)                              
fo.close()

with open('file_name.pdf','rb') as file:
    pdf=PdfFileReader(file)
    info = pdf.getDocumentInfo()
    pages=pdf.numPages
    print(pdf.getPage(1).extractText())

asked Nov 20 '18 at 07:53

Marcus AU

1

use `pdfminer`, `PyPDF2` can not load Chinese correctly – KC. Nov 20 '18 at 10:09
Possible duplicate of [How to read pdf file using pdfminer3k?](https://stackoverflow.com/questions/44024697/how-to-read-pdf-file-using-pdfminer3k) – KC. Nov 20 '18 at 10:22
A simply copy and paste elsewhere from that document worked for me, so no code needed. – Jongware Nov 20 '18 at 18:36

How to extract simplied chinese from remote pdf file in url by python ? The output of my codes turns out to be empty

0 Answers0