Web scraping web crawling a pdf document with url that changes on the website with Python

Question

import os
import requests
from bs4 import BeautifulSoup


desktop = os.path.expanduser("~/Desktop")

url = 'https://www.ici.org/research/stats'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')
excel_files = soup.select('a[href*=xls]')

for each in excel_files:
    if 'Supplement: Worldwide Public Tables' in each.text:
        link = 'https://www.ici.org' + each['href']

        filename = each['href'].split('/')[-1]

        if os.path.isfile(desktop + '/' + filename):
            print ('*** File already exists: %s ***' %filename)
            continue

        resp = requests.get(link)
        output = open(desktop + '/' + filename, 'wb')
        output.write(resp.content)
        output.close()
        print ('Saved: %s' %filename)

I am new to web scraping and I want to automatically download from a list of websites a pdf document.

This document is updated on a monthly basis and the url changes on the website. e.g https://fundcentres.lgim.com/fund-centre/OEIC/Sterling-Liquidity-Fund I want to download the 'factsheet' pdf document from the above website. I think the ideal way would be the code to press the factsheet and saves it to a location on the drive. The difficulty is that the url changes!

Possible duplicate of [How to Download PDFs from Scraped Links \[Python\]?](https://stackoverflow.com/questions/29641671/how-to-download-pdfs-from-scraped-links-python) — thuva4, Oct 03 '19 at 07:35
When asking questions here, readers generally recommend making a start on the code. Although readers are sometimes inclined to do a lot of work in response to a question, we do not think that asking for a full working solution is a reasonable request. If you can show what you are stuck on in particular, with the code you are stuck on, that is much better. — halfer, Oct 03 '19 at 07:51
Hi, the code I currently have is for downloading xls files, this case is for pdfs and I the url cannot be used since it gets updated so I do not know whether the current code I have is helpful — Dimitra, Oct 03 '19 at 08:10
Please try and adapt the code first. It is confusing that the code is not an attempt at the actual problem, — QHarr, Oct 03 '19 at 18:42

Web scraping web crawling a pdf document with url that changes on the website with Python

0 Answers0