Downloading files from multiple websites.

Question

This is my first Python project so it is very basic and rudimentary. I often have to clean off viruses for friends and the free programs that I use are updated often. Instead of manually downloading each program, I was trying to create a simple way to automate the process. Since I am also trying to learn python I thought it would be a good opportunity to practice.

Questions:

I have to find the .exe file with some of the links. I can find the correct URL, but I get an error when it tries to download.

Is there a way to add all of the links into a list, and then create a function to go through the list and run the function on each url? I've Google'd quite a bit and I just cannot seem to make it work. Maybe I am not thinking in the right direction?

import urllib, urllib2, re, os
from BeautifulSoup import BeautifulSoup

# Website List
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
tr = 'http://www.simplysup.com/tremover/download.html'
urllist = [sas, tr, tds, tr]
urrllist2 = []

# Find exe files to download

match = re.compile('\.exe')
data = urllib2.urlopen(urllist)
page = BeautifulSoup(data)

# Check links
#def findexe():
for link in page.findAll('a'):
    try:
        href = link['href']
        if re.search(match, href):
            urllist2.append(href)

    except KeyError:
        pass

os.chdir(r"C:\_VirusFixes")
urllib.urlretrieve(urllist2, os.path.basename(urllist2))

As you can see, I have left the function commented out as I cannot get it to work correctly.

Should I abandon the list and just download them individually? I was trying to be efficient.

Any suggestions or if you could point me in the right direction, it would be most appreciated.

score 1 · Accepted Answer · edited May 23 '17 at 12:18

In addition to mikez302's answer, here's a slightly more readable way to write your code:

import os
import re
import urllib
import urllib2

from BeautifulSoup import BeautifulSoup

websites = [
    'http://cdn.superantispyware.com/SUPERAntiSpyware.exe'
    'http://support.kaspersky.com/downloads/utils/tdsskiller.exe'
    'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1'
    'http://www.simplysup.com/tremover/download.html'
]

download_links = []

for url in websites:
    connection = urllib2.urlopen(url)
    soup = BeautifulSoup(connection)
    connection.close()

    for link in soup.findAll('a', {href: re.compile(r'\.exe$')}):
        download_links.append(link['href'])

for url in download_links:
    urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url))

Thank you for the assistance. I think I see how I was missing the loops now. Unfortunately, it still isn't working for me. It is still having trouble with the URLs. I will keep troubleshooting. — MBH, Nov 16 '12 at 10:06

score 0 · Answer 2 · answered Nov 15 '12 at 00:52

urllib2.urlopen is a function for accessing a single URL. If you want to access multiple ones, you should loop over the list. You should do something like this:

for url in urllist:
    data = urllib2.urlopen(url)
    page = BeautifulSoup(data)

    # Check links
    for link in page.findAll('a'):
        try:
            href = link['href']
            if re.search(match, href):
                urllist2.append(href)

        except KeyError:
            pass

    os.chdir(r"C:\_VirusFixes")
    urllib.urlretrieve(urllist2, os.path.basename(urllist2))

score 0 · Answer 3 · answered Aug 24 '13 at 00:36

The code above didn't work for me, in my case it was because the pages assemble their links through a script instead of including it in the code. When I ran into that problem I used the following code which is just a scraper:

import os
import re
import urllib
import urllib2

from bs4 import BeautifulSoup

url = ''

connection = urllib2.urlopen(url)
soup = BeautifulSoup(connection) #Everything the same up to here 
regex = '(.+?).zip'       #Here we insert the pattern we are looking for
pattern = re.compile(regex)
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text
x=0
for i in link:
    link[x]=i.split(' ')[len(i.split(' '))-1] 
# When it finds all the .zip, it usually comes back with a lot of undesirable 
# text, luckily the file name is almost always separated by a space from the 
# rest of the text which is why we do the split
    x+=1  

os.chdir("F:\Documents")
# This is the filepath where I want to save everything I download

for i in link:
    urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that.

This is not as efficient as the codes in the previous answers but it will work for most almost any site.

Downloading files from multiple websites.

3 Answers3

Linked