0

I am trying to download PDF files from this website.

I am new to Python and am currently learning about the software. I have downloaded packages such as urllib and bs4. However, there is no .pdf extension in any of the URLs. Instead, each one has the following format: http://www.smv.gob.pe/ConsultasP8/documento.aspx?vidDoc={.....}.

I have tried to use the soup.find_all command. However, this was not successful.

from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib

url="http://www.smv.gob.pe/frm_hechosdeImportanciaDia?data=38C2EC33FA106691BB5B5039DACFDF50795D8EC3AF"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")    
links = soup.find_all('a', href=re.compile(r'(http://www.smv.gob.pe/ConsultasP8/documento.aspx?)'))
print(links)
aschultz
  • 1,658
  • 3
  • 20
  • 30
sebasmps
  • 1
  • 1
  • Check my answer. It's pretty similar to your code. I'd say check that `request.urlopen(url).read()` is giving you the expected text. – Sergio Pulgarin Aug 16 '19 at 20:53

1 Answers1

1

This works for me:

import re

import requests
from bs4 import BeautifulSoup

url = "http://www.smv.gob.pe/frm_hechosdeImportanciaDia?data=38C2EC33FA106691BB5B5039DACFDF50795D8EC3AF"
response = requests.get(url).content
soup = BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(http://www.smv.gob.pe/ConsultasP8/documento.aspx?)'))
links = [l['href'] for l in links]
print(links)

Only difference is that I use requests because I'm used to it, and I take the href attribute for each of the returned Tag from BeautifulSoup.

Sergio Pulgarin
  • 869
  • 8
  • 20