0

I'm trying to use BeautifulSoup to scrape .xls tables which are available for download from Xcel Energy's website (https://www.xcelenergy.com/working_with_us/municipalities/community_energy_reports).

This function gets the URL links of the tables and attempts to download them:

url = 'https://www.xcelenergy.com/working_with_us/municipalities/community_energy_reports'
dir = 'C:/Users/aobrien/PycharmProjects/xceldatascraper/'
def scraper(page):
    from bs4 import BeautifulSoup as bs
    import urllib.request
    import requests
    import os
    import re
    tld = r'https://www.xcelenergy.com'
    pageobj = requests.get(page, verify=False)
    sp = bs(pageobj.content, 'html.parser')
    xlst, fnms = [], []
    links = [a['href'] for a in sp.find_all('a', attrs={'href': re.compile("/staticfiles/")})]
    for idx, a in enumerate(links):
        if a.endswith('.xls'):
            furl = tld + str(a)
            xlst.append(furl)
            fnms.append(a.split('/')[4])
    naur = zip(fnms, xlst)
    if not os.path.exists(dir + 'tables'):
        os.makedirs(dir + 'tables')
    for name, url in naur:
        print(url)
        res = urllib.request.urlopen(url)
        xls = open(dir + 'tables/' + name, 'wb')
        xls.write(res.read())
        xls.close()
scraper(url)

The scripts fails when urllib.request.urlopen(url) attempts to access the file, returning "urllib.error.HTTPError: HTTP Error 404: Not Found". The "print(url)" statement prints the url that I had the script construct (https://www.xcelenergy.com/staticfiles/xe-responsive/Working With Us/MI-City-Forest-Lake-2016.xls), and manually pasting that url into a browser downloads the file just fine.

What am I missing?

  • 1
    The URLs you extract all have spaces in them try normalising the URLs first before you download from them see https://stackoverflow.com/questions/120951/how-can-i-normalize-a-url-in-python – Dan-Dev Mar 01 '18 at 21:57
  • I figured that might have been the issue, but couldn't find out how to fix it. Thanks for the link! – deutschnozzle Mar 01 '18 at 22:18

0 Answers0