Download pdfs with python pt.2

Question

I am trying to download several PDFs which are located in different hyperlinks in a single URL. I already asked a similar question here but this URL has a different structure. The URLs that contain the PDF's has the text "p_p_col_count%3D" which is included in the code, but for some reason it does not work.

There is another solution here, but here the web page has (in my opinion) a nice well structured HTML code, while the page that I am trying to scrape has 12 crammed lines of code. Moreover the PDF's in the solution web page can be downloaded in a single link while in my case you need to identify the proper URLs and then download them.

This is "my" code so far:

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/web/guest/resultados/proceso-auditor/auditorias-liberadas/sector-infraestructura-fisica-y-telecomunicaciones-comercio-exterior-y-desarrollo-regional/auditorias-liberadas-infraestructura-2019'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.table > tbody.table-data td.first > a[href*='p_p_col_count%3D']"):
        inner_link = item.get("href")
        resp = s.get(inner_link)
        soup = BeautifulSoup(resp.text,"lxml")
        pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
        file_name = pdf_link.split("/")[-2].split("/")[-1]
        with open(f"{file_name}.pdf","wb") as f:
            f.write(s.get(pdf_link).content)

Best regards

Does this answer your question? https://stackoverflow.com/questions/39237311/downloading-pdfs-from-links-scraped-with-beautiful-soup — programandoconro, Apr 13 '21 at 02:05
Does this answer your question? [Downloading PDFs from links scraped with Beautiful Soup](https://stackoverflow.com/questions/39237311/downloading-pdfs-from-links-scraped-with-beautiful-soup) — Ken White, Apr 13 '21 at 02:43
Hi I replaced the URL and it did not downloaded the documents, the HTML structure is quite different. I am not an HTML expert but the UK Companies House is nicely coded compared with the page I am after. — Aureon, Apr 14 '21 at 02:08

baduker · Accepted Answer · 2021-04-14T07:59:31.830

You have some issues with the CSS selectors, also there's some room to improve handling of the file names, as they not so easy to unify.

You might what to try this:

import re
from urllib.parse import unquote

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/web/guest/resultados/proceso-auditor/auditorias-liberadas/sector-infraestructura-fisica-y-telecomunicaciones-comercio-exterior-y-desarrollo-regional/auditorias-liberadas-infraestructura-2019'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    soup = BeautifulSoup(s.get(link).text, "lxml")
    follow_links = [
        link["href"] for link
        in soup.select(".aui .asset-abstract .asset-content .asset-more a")
    ]

    for follow_link in follow_links:
        soup = BeautifulSoup(s.get(follow_link).text, "lxml")
        pdf_link = soup.select_one(
            ".aui .view .lfr-asset-column-details .download-document a"
        ).get("href")
        pdf_response = s.get(pdf_link)
        pdf_name = pdf_response.headers["Content-Disposition"]
        file_name = "_".join(
            unquote(
                re.split(r"\d{3}", pdf_name, 1)[-1]
            ).split()
        ).replace('"', "")
        print(f"Fetching {file_name}")
        with open(file_name, "wb") as f:
            f.write(pdf_response.content)

Output:

Fetching Actuación_Especial_Contrato_de_Concesión_del_Aeropuerto_El_Dorado.pdf
Fetching ACTUACION_ESPECIAL_DE_FISCALIZACI+ôN_FONDO_DE_ADAPTACION-PUENTE_HISGAURA_MALAGA_LOS_CUROS.pdf
Fetching ACTUACION_ESPECIAL_DE_CONTROL_FISCAL_SERVICIOS_POSTALES_NACIONALES_S.A._472.pdf
Fetching Actuación_Especial_de_Control_Fiscal_-Convenios_suscritos_por_la_Agencia_Nacional_Inmobiliaria_Virgilio_Barco_Vargas.pdf
Fetching Cumplimiento_Superintendencia_de_Transporte.pdf
Fetching Cumplimiento_ANI-_Corredor_Vial_Bogota-Villavicencio.pdf
Fetching Financiera_Cámara_de_Comercio_de_Armenia_y_del_Quind+¡o.pdf
...

Hi thank you for your help, I managed to download the first 2 pdf's but there was error here `---> 26 with open(file_name, "wb") as f:` with this document `OSError: [Errno 22] Invalid argument: 'ACTUACION_ESPECIAL_DE_CONTROL_FISCAL_SERVICIOS_POSTALES_NACIONALES_S.A._472.pdf"'`. I know that you managed to download the documents, probably it can be a codification issue on my side. — Aureon, Apr 14 '21 at 02:24
This probably has something to do with the `"` in the file name. Those files names are terribly formatted. I've updated the answer, try now. — baduker, Apr 14 '21 at 08:00
Thank you I managed to download the documents. Regarding the web page I guess that the HTML code is a mess also. — Aureon, Apr 14 '21 at 12:31

Download pdfs with python pt.2

1 Answers1