Downloading PDFs from a Website using Python

Question

I am completing a Masters in Data Science. I am working on a Text Mining assignment. In this project, I intend to download several PDFs from a website. In this case, I want to scrape and save the document called "Prospectus".

Below is the code which I am using in Python. The prospectus which I wish to download is show in screenshot below. However, the script returns different documents on the web page. Is there something which I need to change within my script?

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"

# If there is no such folder, the script will create one automatically
folder_location = r'.\Output'
if not os.path.exists(folder_location): os.mkdir(folder_location)

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
# Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location, link['href'].split('/')[-1])
with open(filename, 'wb') as f:
    f.write(requests.get(urljoin(url, link['href'])).content)

Prospectus Image

You have not actually looked at the HTML for this page. Look at it with "View Source". That link is NOT PRESENT in the HTML as transmitted, and that's what `requests` sees. That page is generated dynamically using Javascript. You would have to use a real browser, like Selenium, to scrape that. — Tim Roberts, Oct 17 '22 at 20:25

score 1 · Answer 1 · answered Oct 17 '22 at 20:32

1

Try:

import re
import requests
import urllib.parse
from bs4 import BeautifulSoup

url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"
html = requests.get(url).text

ajax_url = (
    "https://www.ishares.com"
    + re.search(r'dataAjaxUrl = "([^"]+)"', html).group(1)
    + "?action=ajax"
)

soup = BeautifulSoup(requests.get(ajax_url).content, "html.parser")
prospectus_url = (
    "https://www.ishares.com"
    + soup.select_one("a:-soup-contains(Prospectus)")["href"]
)

pdf_url = (
    "https://www.ishares.com"
    + urllib.parse.parse_qs(prospectus_url)["iframeUrlOverride"][0]
)

print("Downloading", pdf_url)
with open(pdf_url.split("/")[-1], "wb") as f_out:
    f_out.write(requests.get(pdf_url).content)

Prints:

Downloading https://www.ishares.com/us/literature/prospectus/p-ishares-core-s-and-p-500-etf-3-31.pdf

and saves p-ishares-core-s-and-p-500-etf-3-31.pdf:

-rw-r--r-- 1 root root 325016 okt 17 22:31 p-ishares-core-s-and-p-500-etf-3-31.pdf

answered Oct 17 '22 at 20:32

Andrej Kesely

168,389
15
48
91

How did you figure that the url next to `dataAjaxUrl` should be dealt with @Andrej? – SIM Oct 17 '22 at 21:17
1

@SIM First I've searched for the pdf url in the source (ctrl+u, obviously, there isn't any). Then watched the Network Tab for any ajax calls (there was one interesting) - so from where the ajax url should come from? Back to first page, ctrl+u... the value of the ajax call is another HTML snippet - so parse it... etc. – Andrej Kesely Oct 17 '22 at 21:19
1

Yes, I just noticed it. Thanks. – SIM Oct 17 '22 at 21:20
@AndrejKesely could you please explain in more detail how you watched the network tab for ajax calls and what you did after that? I am 0 in web scraping and have no idea what you are talking about – Sergo055 May 01 '23 at 16:44
@AndrejKesely btw, what exactly was interesting about that ajax call you decided to go after? – Sergo055 May 01 '23 at 17:17

Downloading PDFs from a Website using Python

1 Answers1

Linked