0

I wanna save the company announcement of a listco from a pdf url However, the output file of my python code turns out to be empty.

I tried to extract the text from the pdf directly however, those are simplified chinese and even utf-16 cannot completely decode it.

Please help

import requests
from PyPDF2 import PdfFileReader, PdfFileWriter

url_pdf='http://static.sse.com.cn/disclosure/listedinfo/announcement/c/2018-11-15/601318_20181115_1.pdf'
r = requests.get(url_pdf)
fo = open('file_name.pdf','wb')                        
fo.write(r.content)                              
fo.close()

with open('file_name.pdf','rb') as file:
    pdf=PdfFileReader(file)
    info = pdf.getDocumentInfo()
    pages=pdf.numPages
    print(pdf.getPage(1).extractText())
Marcus AU
  • 39
  • 2
  • 1
    use `pdfminer`, `PyPDF2` can not load Chinese correctly – KC. Nov 20 '18 at 10:09
  • Possible duplicate of [How to read pdf file using pdfminer3k?](https://stackoverflow.com/questions/44024697/how-to-read-pdf-file-using-pdfminer3k) – KC. Nov 20 '18 at 10:22
  • A simply copy and paste elsewhere from that document worked for me, so no code needed. – Jongware Nov 20 '18 at 18:36

0 Answers0