Python Web Crawler from thenewboston

Question

I recently watched a thenewboston video on writing a web crawler using python. For some reason, I'm getting a SSLError. I tried fixing it with line 6 of code but no luck. Any idea why it's throwing errors? The code is verbatim from thenewboston.

import requests
from bs4 import BeautifulSoup

def creepy_crawly(max_pages):
    page = 1
    #requests.get('https://www.thenewboston.com/', verify = True)
    while page <= max_pages:

        url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text 
        soup = BeautifulSoup(plain_text)

        for link in soup.findAll('a', {'class' : 'item-name'}):
            href = "https://www.thenewboston.com" + link.get('href')
            print(href)

        page += 1

creepy_crawly(1)

an ssl error is due to web certificates. its probably happening because the url you are trying to crawl is `https`. Try a different site with only http. — Craicerjack, Nov 24 '14 at 19:24
Possible duplicate of http://stackoverflow.com/q/10667960/783219 — Prusse, Nov 24 '14 at 19:46
Thank you Craicerjack! I tried it on a website without only "http" and it worked! But how would I go about running a web crawler on a domain with "https"? — Steven, Nov 24 '14 at 20:10

score 0 · Answer 1 · answered Nov 29 '16 at 06:19

I've done a web crawler using urllib, it can be faster and has no problem accessing https pages, one thing though is that it doesn't validates the server certificate, this make it faster but more dangerous ( vulnerable to mitm attacks). Bellow there's an usage example of that lib:

link = 'https://www.stackoverflow.com'    
html = urllib.urlopen(link).read()
print(html)

3 lines is all you need to grab the HTML from a page, simple isn't it?

More about urllib: https://docs.python.org/2/library/urllib.html

I also recommend you use regex on the HTML to grab other links, an example for that (using re library) would be:

    for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I):  # Searches the HTML for other URLs
        link = url.split("#", 1)[0] \
        if url.startswith("http") \
        else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it

Isn't it a general rule that you shouldn't use regex to parse HTML? — Steven, Dec 05 '16 at 18:00
Regex is considered slow in many languages, but python doesn't seem like the case, my web crawler is able to process 10 links per second, unless you want something faster than that regex should serve you fine, needless to say regex is very precise . — ArthurG, Dec 06 '16 at 19:00

Python Web Crawler from thenewboston

1 Answers1