1

I am completing a Masters in Data Science. I am working on a Text Mining assignment. In this project, I intend to download several PDFs from a website. In this case, I want to scrape and save the document called "Prospectus".

Below is the code which I am using in Python. The prospectus which I wish to download is show in screenshot below. However, the script returns different documents on the web page. Is there something which I need to change within my script?

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"

# If there is no such folder, the script will create one automatically
folder_location = r'.\Output'
if not os.path.exists(folder_location): os.mkdir(folder_location)

response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
# Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location, link['href'].split('/')[-1])
with open(filename, 'wb') as f:
    f.write(requests.get(urljoin(url, link['href'])).content)

Prospectus Image

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 2
    You have not actually looked at the HTML for this page. Look at it with "View Source". That link is NOT PRESENT in the HTML as transmitted, and that's what `requests` sees. That page is generated dynamically using Javascript. You would have to use a real browser, like Selenium, to scrape that. – Tim Roberts Oct 17 '22 at 20:25

1 Answers1

1

Try:

import re
import requests
import urllib.parse
from bs4 import BeautifulSoup

url = "https://www.ishares.com/us/products/239726/ishares-core-sp-500-etf"
html = requests.get(url).text

ajax_url = (
    "https://www.ishares.com"
    + re.search(r'dataAjaxUrl = "([^"]+)"', html).group(1)
    + "?action=ajax"
)

soup = BeautifulSoup(requests.get(ajax_url).content, "html.parser")
prospectus_url = (
    "https://www.ishares.com"
    + soup.select_one("a:-soup-contains(Prospectus)")["href"]
)

pdf_url = (
    "https://www.ishares.com"
    + urllib.parse.parse_qs(prospectus_url)["iframeUrlOverride"][0]
)

print("Downloading", pdf_url)
with open(pdf_url.split("/")[-1], "wb") as f_out:
    f_out.write(requests.get(pdf_url).content)

Prints:

Downloading https://www.ishares.com/us/literature/prospectus/p-ishares-core-s-and-p-500-etf-3-31.pdf

and saves p-ishares-core-s-and-p-500-etf-3-31.pdf:

-rw-r--r-- 1 root root 325016 okt 17 22:31 p-ishares-core-s-and-p-500-etf-3-31.pdf
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • How did you figure that the url next to `dataAjaxUrl` should be dealt with @Andrej? – SIM Oct 17 '22 at 21:17
  • 1
    @SIM First I've searched for the pdf url in the source (ctrl+u, obviously, there isn't any). Then watched the Network Tab for any ajax calls (there was one interesting) - so from where the ajax url should come from? Back to first page, ctrl+u... the value of the ajax call is another HTML snippet - so parse it... etc. – Andrej Kesely Oct 17 '22 at 21:19
  • 1
    Yes, I just noticed it. Thanks. – SIM Oct 17 '22 at 21:20
  • @AndrejKesely could you please explain in more detail how you watched the network tab for ajax calls and what you did after that? I am 0 in web scraping and have no idea what you are talking about – Sergo055 May 01 '23 at 16:44
  • @AndrejKesely btw, what exactly was interesting about that ajax call you decided to go after? – Sergo055 May 01 '23 at 17:17