5

I am trying to create a pdf puller from the Australian Stock Exchange website which will allow me to search through all the 'Announcements' made by companies and search for key words in the pdfs of those announcements.

What I have done so far is used the requests library. Below is my code so far:

import requests

url = 'http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf'
response = requests.get(url)

print(response.content)

However what prints is the following string (I will cut this off as it will be too long):

> b'%PDF-1.5\r%\xe2\xe3\xcf\xd3\r\n5 0 obj\r<</E 212221/H [ 1081 145 ]/L
> 212973/Linearized 1/N 1/O 8/T 212553>>\rendobj\r                      
> \r\r42 0 obj\r<</DecodeParms <</Columns 5/Predictor 12>>/Encrypt 7 0
> R/Filter /FlateDecode/ID [(\\216\\203\\217T\\n\\f\\236\\345?%\\214t4
> E\\271) (\\216\\203\\217T\\n\\f\\236\\345?%\\214t4 E\\271)]/Index [5
> 38]/Info 3 0 R/Length 86/Prev 212554/Root 6 0 R/Size 43/Type /XRef/W
> [1 3
> 1]>>\rstream\nx\x9ccbd`\x10``b``:\x04"\x19\xab\xc1d-X\xc4\x06D2\xac\x02\xb3\x93\xc0\xe2\x1d
> \x92?\x07,\x1e\t"\xb9T\x80$\xe3\x84\xcb@\x92\xa9m"\x03\x13\xe3\xdf\x13Z`Y\x06\xc6\x01#\xff3\xb0h\xbcfb`\xb6\x12\x02\xba\xe4\xef!S\x06\x0

I have searched stackexchange and other websites for a few days, and have tried to use print(response.content.decode('utf-8') as well as ascii but neither of them amount to anything I can read.

Apologies as I know it is obvious that I am a noobie, but any help would be greatly appreciated!

Thanks a lot.

James Ward
  • 357
  • 2
  • 3
  • 11

3 Answers3

9

PDF file is binary mode, you should read it as its format with its headers and footers. you can not read bianry files as raw string.

1) If you have ANY spaces in your file name, then PyPDF 2 decrypt function will ultimately fail despite returning a success code. Try to stick to underscores when naming your PDFs before you run them through PyPDF2.

For example, Rather than "my pdf.pdf" do something like "my_pdf.pdf".

2) Try to decrypt it using an empty string as password and it works.

Try This :

import requests, PyPDF2


url = 'http://www.asx.com.au/asxpdf/20171103/pdf/43nyyw9r820c6r.pdf'
response = requests.get(url)
my_raw_data = response.content

with open("my_pdf.pdf", 'wb') as my_data:
    my_data.write(my_raw_data)

open_pdf_file = open("my_pdf.pdf", 'rb')
read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
if read_pdf.isEncrypted:
    read_pdf.decrypt("")
    print(read_pdf.getPage(0).extractText())

else:
    print(read_pdf.getPage(0).extractText())
DRPK
  • 2,023
  • 1
  • 14
  • 27
  • This is amazing! Thank you so much. How would I convert the my_raw_data variable to a string to then manipulate without having to write it to a PDF and then read it? Or is this not possible? Thanks a lot. – James Ward Nov 08 '17 at 09:07
  • @JamesWard: you want to extract contents from your pdf as string? and save that in a txt file? – DRPK Nov 08 '17 at 09:11
  • @JamesWard: just change print(ready pdf.getPage(0).extract Text()) to blabla = read pdf.getPage(0).extract Text(), now you have a string variable, you can do anything with that! may you tick me plz :) – DRPK Nov 08 '17 at 09:22
  • But I still have a pdf file on my computer. I am wondering if I can go from my_raw_data to a string without having to write it to a pdf file, reading the pdf file and then deleting the pdf file (as I want to do this thousands of times). Thanks! Ticked you :) – James Ward Nov 08 '17 at 09:28
  • @JamesWard: i recommend you to ask a new question about 'how can i read a pdf file from inline raw_bytes (not from file)". – DRPK Nov 08 '17 at 09:38
  • I posted it here https://stackoverflow.com/questions/47177060/how-can-i-read-a-pdf-file-from-inline-raw-bytes-not-from-file – James Ward Nov 08 '17 at 10:19
  • @JamesWard: i answered you.. check it :) – DRPK Nov 08 '17 at 10:33
  • __WARNING: Breaking changes in version 3__: As of Jan 2023, this code breaks as PyPDF2 was upgraded to a new major version: 3.0.0. Replace `PyPDF2.PdfFileReader` with `PyPDF2.PdfReader` and `read_pdf.getPage(0).extractText()` with `read_pdf.pages[0].extract_text()` – Nicolas Dao Mar 19 '23 at 06:41
0

That response is the encoded string representing the contents of the PDF. You need to use an extraction tool such as pdfminer. There is an example on the page showing you how to do a sample extraction via Python.

  • I have tried using PyPDF2 however it says that the pdf is encrypted and I am unsure what the password could be. This is my code: `import PyPDF2 pdfobj = open('test.pdf', 'rb') pdfreader = PyPDF2.PdfFileReader(pdfobj) #pageobj = pdfreader.getPage(0) print(pdfreader.isEncrypted)` This returns true but I am unsure what the password could be. – James Ward Nov 08 '17 at 04:01
  • Can you open the "test.pdf" on another application? If not, then it is probably password encrypted. In this case, without knowing the password or the encryption algorithm used, it would be non-trivial to open. – Sohail Khan Nov 08 '17 at 06:09
  • I can open it on my computer without a password. I'm quite confused. Thank you for your quick responses. – James Ward Nov 08 '17 at 08:03
0

You can simply paste a URL into a shell script, as I did here with that address but it could be a list of addresses

enter image description here

@echo off&Title PDF URL TO TXT&Color 9F
if not "%1"=="" set "URL=%1"
if "%1"=="" set /p "URL=URL ? "

curl -o "%temp%\temp.pdf" "%URL%"
timeout 5
"%temp%\temp.pdf"
"C:\Apps\PDF\poppler\23.01.0\Library\bin\pdftotext.exe" -layout -nopgbrk -enc UTF-8 "%temp%\temp.pdf"
notepad "%temp%\temp.txt"

enter image description here

So several ways to copy paste the URL as I did from question after click the place I keep handy dropdown commands, and many ways to parse the resultant text file to find a word, but simplest is find, cut & paste from now open file.

Clearly will not work for a minority of more secured target sites but should for the majority of conventional PDF URLs.

K J
  • 8,045
  • 3
  • 14
  • 36