0

I am trying to download several PDFs which are located in different hyperlinks in a single URL. My approach was first to retrieve the the URLs with contained the "fileEntryId" text which contains the PDFs, according to this link and secondly try to download the PDF files using this approach link.

This is "my" code so far:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer
import re
import os
import requests
from urllib.parse import urljoin


http = httplib2.Http()
status, response = http.request('https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015')

for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a', href=re.compile('.*fileEntryId.*'))):
    if link.has_attr('href'):
        x=link['href']
        
        #If there is no such folder, the script will create one automatically
        folder_location = r'c:\webscraping'
        if not os.path.exists(folder_location):os.mkdir(folder_location)

        response = requests.get(x)
        soup= BeautifulSoup(response.text, "html.parser")     
        for link in soup.select("x"):
            #Name the pdf files using the last portion of each link which are unique in this case
            filename = os.path.join(folder_location,link['href'].split('/')[-1])
            with open(filename, 'wb') as f:
                f.write(requests.get(urljoin(url,link['href'])).content)

Thank you

Aureon
  • 141
  • 9
  • The PDFs are not "embedded". Have you looked at the source code for this pages you're fetching? You are searching for `` tags, and I don't think there are any `` tags. The pages are complicated – Tim Roberts Apr 04 '21 at 04:31
  • Hi Tim, thank you for the feedback I did the changes in the question to avoid missleading solutions. – Aureon Apr 04 '21 at 13:41

1 Answers1

1

Create a folder anywhere and put the script in that folder. When you run the script, you should get the downloaded pdf files within the folder. If for some reason the script doesn't work for you, make sure to check whether your bs4 version is up to date as I've used pseudo css selectors to target the required links.

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.table > tbody.table-data td.first > a[href*='fileEntryId']"):
        inner_link = item.get("href")
        resp = s.get(inner_link)
        soup = BeautifulSoup(resp.text,"lxml")
        pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
        file_name = pdf_link.split("/")[-1].split("?")[0]
        with open(f"{file_name}.pdf","wb") as f:
            f.write(s.get(pdf_link).content)
MITHU
  • 113
  • 3
  • 12
  • 41