0

I'd like to crawl through Tesla's list of superchargers and open each individual page to record the number of connectors and charging rates. This is one of my first programs so I'm sure I'm doing a few things wrong, but I can't get past the HTTP Error 403 when I use urlopen to open multiple urls. Any help would be greatly appreciated!

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import csv


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'https://www.tesla.com/findus/list/superchargers/United%20States'
html = urlopen(url, context=ctx).read()
soup_main = BeautifulSoup(html, "html.parser")

data = []
for tag in soup_main('a'):
    if '/findus/location/supercharger/' in tag.get('href',None):
        url_sc = 'https://www.tesla.com' + tag['href']
        html_sc = urlopen(url_sc, context=ctx).read()
        soup_sc = BeautifulSoup(html_sc, "html.parser")
        address = soup_sc.find('span', class_='street-address').string
        city = soup_sc.find('span', class_='locality').string[:-5]
        state = soup_sc.find('span', class_='locality').string[-3:]
        details = soup_sc.find_all('p')[1].contents[-1]
        data.append([address, city, state, details])

header = ['Address', 'City', 'State', 'Details']
with open('datapull.csv', 'w') as fp:
   writer = csv.writer(fp, delimiter=',')
   writer.writerow(header)
   for row in data:
      writer.writerow(row)
  • Many websites forbid you from making any requests if you're not using a web browser. The easiest way to get around that is to use Chrome Headless through Selenium. – Boris Verkhovskiy Nov 24 '20 at 21:46
  • HTTP 403 is a forbidden error. Meaning you are not authorized to access the endpoint although the server understands the request. – picmate 涅 Nov 24 '20 at 21:46
  • It looks like this particular website requires you to pass in a specific cookie. You can get the website to respond to your python request by first going to the page in the browser and looking at the networking tab. You can grab the ```ak_bmsc``` value from the ```cookie``` header and then add that header to your python request. This worked for me! – sarartur Nov 24 '20 at 21:56
  • @daktoad Do you have a link to an example I can reference? I'm not familiar and trying to find more information. Thanks! – Mostapasta Nov 24 '20 at 23:11
  • @Mostapasta You can add the headers to your request as shown in the examples of the ```urllib``` documentation: https://docs.python.org/3/library/urllib.request.html. To find he cookie value in your browser you can refer to this article: https://developers.google.com/web/tools/chrome-devtools/network – sarartur Nov 24 '20 at 23:31
  • @Mostapasta I actually just played around with this problem a bit more and it does not look its necessarily that cookie that you need. I would just play around with different headers and see if you can get it to work. – sarartur Nov 24 '20 at 23:49

1 Answers1

0

try to add headers to fake a browser:

import urllib.request

# Request with Header Data to send User-Agent header
url_sc = 'https://www.journaldev.com'

headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'

request = urllib.request.Request(url_sc, headers=headers)
resp = urllib.request.urlopen(request)

source https://www.journaldev.com/20795/python-urllib-python-3-urllib#python-urllib-request-with-header

  • I tried using your example user agent information and my own and received the following error with both, "urllib.error.URLError: " – Mostapasta Nov 24 '20 at 23:10
  • @Mostapasta i edited it and "it works on my machine" are you hosting your script on a pythonanywehere free account? – gabrielesilinic Nov 25 '20 at 11:56
  • @Mostapasta make sure that you have added https etc all right, and also urlopen doesnt handle redirects and also try [this](https://stackoverflow.com/questions/35569042/ssl-certificate-verify-failed-with-python3) – gabrielesilinic Nov 25 '20 at 12:06
  • I tried again and was able to get it to work. I don't know what the issue was the first time. Thank you for the help!! – Mostapasta Nov 25 '20 at 18:30
  • @Mostapasta no problem, i have tried to download html in a random day when i was bored to see if i was be able to find any protections and circumvent those (but i haven't done it since a long time ago so my memory was a little dusty about those things), if mine was the definitive answer mark it as solution – gabrielesilinic Nov 27 '20 at 11:45