Download files using requests and BeautifulSoup

Question

I'm trying download a bunch of pdf files from here using requests and beautifulsoup4. This is my code:

import requests
from bs4 import BeautifulSoup as bs

_ANO = '2013/'
_MES = '01/'
_MATERIAS = 'matematica/'
_CONTEXT = 'wp-content/uploads/' + _ANO + _MES
_URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT

r = requests.get(_URL)
soup = bs(r.text)

for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')

    for x in range(i):
        output = open('file[%d].pdf' % x, 'wb')
        output.write(_FULLURL.read())
        output.close()

I'm getting AttributeError: 'str' object has no attribute 'read'.

Ok, I know that, but... how can I download from that URL generated?

FULLURL is obviously a string. And you likely want to fetch the content of this URL and store the content instead of storing the URL. So please fix your code. — , Sep 27 '13 at 17:02
open the url with `requests.get(_FULLURL)` and you'll be able to save the contents of the response's content to file. — TankorSmash, Sep 27 '13 at 17:36
@user2799617 string shouldn't have a read() method, I just wanna say that I need to know another method to get this URL like an URL. — Filipe Manuel, Sep 27 '13 at 19:46

samstav · Answer 1 · 2013-10-05T20:54:09.913

This will write all the files from the page with their original filenames into a pdfs/ directory.

import requests
from bs4 import BeautifulSoup as bs
import urllib2


_ANO = '2013/'
_MES = '01/'
_MATERIAS = 'matematica/'
_CONTEXT = 'wp-content/uploads/' + _ANO + _MES
_URL = 'http://www.desconversa.com.br/' + _MATERIAS + _CONTEXT

# functional
r = requests.get(_URL)
soup = bs(r.text)
urls = []
names = []
for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')
    if _FULLURL.endswith('.pdf'):
        urls.append(_FULLURL)
        names.append(soup.select('a')[i].attrs['href'])

names_urls = zip(names, urls)

for name, url in names_urls:
    print url
    rq = urllib2.Request(url)
    res = urllib2.urlopen(rq)
    pdf = open("pdfs/" + name, 'wb')
    pdf.write(res.read())
    pdf.close()

@filipe-manuel I tested this and successfully downloaded the files. Did this work for you? — samstav, Oct 05 '13 at 20:52

score 6 · Answer 2 · answered Jul 24 '14 at 08:34

6

It might be easier with wget, because then you have the full power of wget (user agent, follow, ignore robots.txt ...), if necessary:

import os

names_urls = zip(names, urls)

for name, url in names_urls:
    print('Downloading %s' % url)
    os.system('wget %s' % url)

answered Jul 24 '14 at 08:34

Balzer82

999
4
10
21

I agree with @Balzer82 here, shorted code using wget after parsing all the needed file urls – Anirudh Mar 17 '16 at 14:44

Brett Larson · Answer 3 · 2021-08-18T16:04:52.453

I have adopted samstav's answer to use Python3 - Additionally using urllib2 is not necessary in my example.

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.debian.org"
path = "/releases/stable/releasenotes"
_URL = baseurl + path

r = requests.get(_URL)

soup = BeautifulSoup(r.text)
urls = []
names = []
for i, link in enumerate(soup.findAll("a")):
    _FULLURL = (baseurl + str(link.get("href")))
    if _FULLURL.endswith(".pdf"):
        urls.append(_FULLURL)
        names.append(soup.select("a")[i].attrs["href"])

names_urls = zip(names, urls)

for name, url in names_urls:
    print(url)
    r = requests.get(url)
    with open("pdfs//" + name.split('/')[-1], "wb") as f:
        f.write(r.content)

Download files using requests and BeautifulSoup

3 Answers3

Linked