How to improve this web crawler logic?

Question

Im working on a web crawler that will crawl only internal links using requests and bs4.

I have a rough working version below but Im not sure how to properly handle checking if a link has been crawled previously or not.

import re
import time
import requests
import argparse
from bs4 import BeautifulSoup


internal_links = set()

def crawler(new_link):


    html = requests.get(new_link).text 
    soup = BeautifulSoup(html, "html.parser")
    for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
        if "href" in link.attrs:
            print(link)
            if link.attrs["href"] not in internal_links:
                new_link = link.attrs["href"]
                print(new_link)
                internal_links.add(new_link)
                print("All links found so far, ", internal_links)
                time.sleep(6)
                crawler(new_link)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('url', help='Pass the website url you wish to crawl')
    args = parser.parse_args()

    url = args.url

    #Check full url has been passed otherwise requests will throw error later

    try:
        crawler(url)

    except:
        if url[0:4] != 'http':
            print('Please try again and pass the full url eg http://example.com')



if __name__ == '__main__':
    main()

These are the last few lines of the output:

All links found so far,  {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
<a href="http://quotes.toscrape.com/search.aspx">ViewState</a>
http://quotes.toscrape.com/search.aspx
All links found so far,  {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/search.aspx', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}
<a href="http://quotes.toscrape.com/random">Random</a>
http://quotes.toscrape.com/random
All links found so far,  {'http://quotes.toscrape.com/tableful', 'http://quotes.toscrape.com', 'http://quotes.toscrape.com/js', 'http://quotes.toscrape.com/search.aspx', 'http://quotes.toscrape.com/scroll', 'http://quotes.toscrape.com/random', 'http://quotes.toscrape.com/login', 'http://books.toscrape.com', 'http://quotes.toscrape.com/'}

so it is working, but only up until a certain point and then it doesn't seem to follow the links any further.

Im sure its because of this line

for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):

as that will only find the links that start with http and on a lot of the internal pages the links dont have that but when I try it like this

for link in soup.find_all('a')

the program runs very briefly and then ends:

http://books.toscrape.com
{'href': 'http://books.toscrape.com'}
http://books.toscrape.com
All links found so far,  {'http://books.toscrape.com'}
index.html
{'href': 'index.html'}
index.html
All links found so far,  {'index.html', 'http://books.toscrape.com'}

score 1 · Answer 1 · answered Mar 20 '19 at 06:57

1

You could reduce

for link in soup.find_all('a', attrs={'href': re.compile("^http://")}):
        if "href" in link.attrs:
            print(link)
            if link.attrs["href"] not in internal_links:
                new_link = link.attrs["href"]
                print(new_link)
                internal_links.add(new_link)

To

links = {link['href'] for link in soup.select("a[href^='http:']")}
internal_links.update(links)

This uses a grabs only qualifying a tag elements with http protocol and uses a set comprehension to ensure no dupes. It then updates the existing set with any new links. I don't know enough python to comment on efficiency of using .update but I believe it modifies the existing set rather than creating a new one. More methods for combining sets are listed here: How to join two sets in one line without using "|"

answered Mar 20 '19 at 06:57

QHarr

83,427
12
54
101

thanks, this partially works but it still has the problem of not being able to find links without the http eg (about) and thus its just crawling the top level pages and not going any deeper. – goblin_rocket Mar 21 '19 at 00:07
Do the relative paths always start with /author? – QHarr Mar 21 '19 at 07:46
Can you supply an example url? You would simply add the selector for the relative href urls to the existing selector using css Or syntax soup.select("a[href^='http:'], a[href^='/'] ") Not sure if you need to escape that last / with // – QHarr Mar 21 '19 at 07:54
I was testing against http://toscrape.com but that site doesn't use full paths for internal links which is why it wasn't finding them all. Ive tried it with a site with full url paths and now it works, ideally this would work for both situations but this will do for now thanks. I will post my finished code soon and mark the answer. – goblin_rocket Mar 23 '19 at 01:15
You can add in relative path starts but you need to use css Or syntax and escape the leading / which thinking about it might actually require \/ rather than //. I can use the link to update if you want. – QHarr Mar 23 '19 at 04:44

How to improve this web crawler logic?

1 Answers1