How can i read a PDF file from inline raw_bytes (not from file)?

Question

I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.

So far I am using requests and PyPDF2 to get the PDF file, write it to my drive and then read it. However, I want to be able to skip the step of writing the PDF file to my drive and reading it, and going straight from getting the PDF file to converting it to a string. What I have so far is:

import requests, PyPDF2

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)
my_raw_data = response.content

with open("my_pdf.pdf", 'wb') as my_data:
    my_data.write(my_raw_data)


open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
num_pages = read_pdf.getNumPages()

ann_text = []
for page_num in range(num_pages):
    if read_pdf.isEncrypted:
        read_pdf.decrypt("")
        print(read_pdf.getPage(page_num).extractText())
        page_text = read_pdf.getPage(page_num).extractText().split()
        ann_text.append(page_text)

    else:
        print(read_pdf.getPage(page_num).extractText())
print(ann_text)

This prints a list of strings in the PDF file from the url provided.

Just wondering if i can convert the my_raw_data variable to a readable string?

Thanks so much in advance!

... curl it instead of using python, then read _that_ in? – Mike 'Pomax' Kamermans Jan 01 '21 at 05:17 — Mike 'Pomax' Kamermans, Jan 01 '21 at 05:17

score 31 · Accepted Answer · edited Dec 25 '22 at 18:09

31

You can use io.

PyPDF2 >= 2.0.0

import requests, PyPDF2, io
from PyPDF2 import PdfReader  # you can also use pypdf>=3.1.0

url = 'http://www.asx.com.au/asxpdf/20171108/pdf/43p1l61zf2yct8.pdf'
response = requests.get(url)

with io.BytesIO(response.content) as open_pdf_file:
    reader = PdfReader(open_pdf_file)
    num_pages = len(reader.pages)
    print(num_pages)

prints 2.

PS. To open files, always use a context manager (with-statement)

edited Dec 25 '22 at 18:09

Martin Thoma

124,992
159
614
958

answered Nov 08 '17 at 10:29

Maarten Fabré

6,938
1
17
36

Why should you always use a context manager? – alias51 Oct 12 '21 at 18:18

DRPK · Answer 2 · 2021-01-01T05:21:35.233

Try This (With IO module and an additional decryptor) :

import requests, PyPDF2, io


url = 'http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf'
response = requests.get(url).content

reserve_pdf_on_memory = io.BytesIO(response)
load_pdf = PyPDF2.PdfFileReader(reserve_pdf_on_memory)

if load_pdf.isEncrypted:
    load_pdf.decrypt("")
    print(load_pdf.getPage(0).extractText())

else:
    print(load_pdf.getPage(0).extractText())

Good Luck ... :)

How can i read a PDF file from inline raw_bytes (not from file)?

2 Answers2

PyPDF2 >= 2.0.0

Linked