0

I recently watched a thenewboston video on writing a web crawler using python. For some reason, I'm getting a SSLError. I tried fixing it with line 6 of code but no luck. Any idea why it's throwing errors? The code is verbatim from thenewboston.

import requests
from bs4 import BeautifulSoup

def creepy_crawly(max_pages):
    page = 1
    #requests.get('https://www.thenewboston.com/', verify = True)
    while page <= max_pages:

        url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text 
        soup = BeautifulSoup(plain_text)

        for link in soup.findAll('a', {'class' : 'item-name'}):
            href = "https://www.thenewboston.com" + link.get('href')
            print(href)

        page += 1

creepy_crawly(1)
Steven
  • 790
  • 7
  • 16
  • 1
    an ssl error is due to web certificates. its probably happening because the url you are trying to crawl is `https`. Try a different site with only http. – Craicerjack Nov 24 '14 at 19:24
  • Possible duplicate of http://stackoverflow.com/q/10667960/783219 – Prusse Nov 24 '14 at 19:46
  • Thank you Craicerjack! I tried it on a website without only "http" and it worked! But how would I go about running a web crawler on a domain with "https"? – Steven Nov 24 '14 at 20:10

1 Answers1

0

I've done a web crawler using urllib, it can be faster and has no problem accessing https pages, one thing though is that it doesn't validates the server certificate, this make it faster but more dangerous ( vulnerable to mitm attacks). Bellow there's an usage example of that lib:

link = 'https://www.stackoverflow.com'    
html = urllib.urlopen(link).read()
print(html)

3 lines is all you need to grab the HTML from a page, simple isn't it?

More about urllib: https://docs.python.org/2/library/urllib.html

I also recommend you use regex on the HTML to grab other links, an example for that (using re library) would be:

    for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I):  # Searches the HTML for other URLs
        link = url.split("#", 1)[0] \
        if url.startswith("http") \
        else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it  
ArthurG
  • 335
  • 2
  • 11
  • Isn't it a general rule that you shouldn't use regex to parse HTML? – Steven Dec 05 '16 at 18:00
  • Regex is considered slow in many languages, but python doesn't seem like the case, my web crawler is able to process 10 links per second, unless you want something faster than that regex should serve you fine, needless to say regex is very precise . – ArthurG Dec 06 '16 at 19:00