0

I'm trying to write a script which will iterate through a list of landing page urls from a csv file, append all PDF links on the landing page to a list, and then iterate through the list downloading the PDFs to a specified folder.

I'm a bit stuck on the last step- I can get all the PDF urls, but can only download them individually. I'm not sure how best to amend the directory address to change with each url to ensure each has its own unique file name.

Any help would be appreciated!

from bs4 import BeautifulSoup, SoupStrainer
import requests
import re

#example url
url = "https://beta.companieshouse.gov.uk/company/00445790/filing-history"
link_list = []
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")

for a in soup.find_all('a', href=True):
    if "document" in a['href']:
        link_list.append("https://beta.companieshouse.gov.uk"+a['href'])

for url in link_list:

    response = requests.get(url)

    with open('C:/Users/Desktop/CompaniesHouse/report.pdf', 'wb') as f:
        f.write(response.content)
hlbau
  • 1
  • 1
  • 2

1 Answers1

2

The simplest thing is to just add a number to each filename using enumerate:

for ind, url in enumerate(link_list, 1):
    response = requests.get(url)

    with open('C:/Users/Desktop/CompaniesHouse/report_{}.pdf'.format(ind), 'wb') as f:
        f.write(response.content)

But presuming each path ends in somne_filename.pdf and they are unique you can use the basename itself which may be more descriptive:

from os.path import basename, join
for url in link_list:  
    response = requests.get(url)   
    with open(join('C:/Users/Desktop/CompaniesHouse", basename(url)), 'wb') as f:
        f.write(response.content)
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321