Endless crawling because of different session ids in url

Question

How to prevent scrapy from crawling a website endless, when only the url particularly the session id or something like that is altered and the content behind the urls is the same. Is there a way to detect that?

I've read this Avoid Duplicate URL Crawling, Scrapy - how to identify already scraped urls and that how to filter duplicate requests based on url in scrapy, but for solving my problem this is sadly not enough.

score 0 · Answer 1 · edited May 23 '17 at 12:11

There are a couple of ways to do this, both related to the questions you've linked to.

With one, you decide what URL parameters make a page unique, and tell your custom duplicate request filter to ignore the other portions of the URL. This is similar to the answer at https://stackoverflow.com/a/13605919 .

Example:

url: http://www.example.org/path/getArticle.do?art=42&sessionId=99&referrerArticle=88
important bits: protocol, host, path, query parameter "art"
implementation:
def url_fingerprint(self, url):
    pr = urlparse.urlparse(url)
    queryparts = pr.query.split('&')
    for prt in queryparts:
        if prt.split("=")[0] != 'art':
            queryparts.remove(prt)
   return urlparse.urlunparse(ParseResult(scheme=pr.scheme, netloc=pr.netloc, path=pr.path, params=pr.params, query='&'.join(queryparts), fragment=pr.fragment))

The other way is to determine what bit of information on the page make it unique, and use either the IgnoreVisitedItems middleware (as per https://stackoverflow.com/a/4201553) or a dictionary/set in your spider's code. If you go the dictionary/set route, you'll have your spider extract that bit of information from the page and check the dictionary/set to see if you've seen that page before; if so, you can stop parsing and return.

What bit of information you'll need to extract depends on your target site. It could be the title of the article, an OpenGraph <og:url> tag, etc.

Endless crawling because of different session ids in url

1 Answers1