I am trying to download several PDFs which are located in different hyperlinks in a single URL. I already asked a similar question here but this URL has a different structure. The URLs that contain the PDF's has the text "p_p_col_count%3D" which is included in the code, but for some reason it does not work.
There is another solution here, but here the web page has (in my opinion) a nice well structured HTML code, while the page that I am trying to scrape has 12 crammed lines of code. Moreover the PDF's in the solution web page can be downloaded in a single link while in my case you need to identify the proper URLs and then download them.
This is "my" code so far:
import requests
from bs4 import BeautifulSoup
link = 'https://www.contraloria.gov.co/web/guest/resultados/proceso-auditor/auditorias-liberadas/sector-infraestructura-fisica-y-telecomunicaciones-comercio-exterior-y-desarrollo-regional/auditorias-liberadas-infraestructura-2019'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("table.table > tbody.table-data td.first > a[href*='p_p_col_count%3D']"):
inner_link = item.get("href")
resp = s.get(inner_link)
soup = BeautifulSoup(resp.text,"lxml")
pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
file_name = pdf_link.split("/")[-2].split("/")[-1]
with open(f"{file_name}.pdf","wb") as f:
f.write(s.get(pdf_link).content)
Best regards