Scrapy - how to identify already scraped urls

Question

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor.

To do this, you'll have to store what URLs you've scraped. Are you doing that? If so, how? — Dominic Rodger, Oct 06 '10 at 10:43

score 15 · Answer 1 · edited Sep 25 '16 at 12:29

15

You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:

SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }

The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

Finally, you'll need to modify your items.py so that each item class has the following fields:

visit_id = Field()
visit_status = Field()

And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.

Good luck!

edited Sep 25 '16 at 12:29

chirinosky

4,438
1
28
39

answered Nov 17 '10 at 04:39

Jama22

957
1
6
20

I did everything as you mentioned but that didn't help. It still crawls the same url. – Dec 26 '13 at 06:34
The link mentioned is here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/ now. – vrleboss Jan 20 '16 at 15:00
Well I followed these steps, it assigns a visit_id and visit_status as new. But scans the same items again and again in every run (and assigns same visit_id and visit_status as new). Any clues? – Anshu Feb 06 '16 at 05:06
The code snippet's date is 2010, so probably you are using way too new scrapy than it does. – Mehmet Kurtipek Jun 16 '19 at 12:04

score 1 · Answer 2 · answered Nov 27 '12 at 06:32

1

Scrapy can auto-filter urls which are scraped, isn't it? Some different urls point to the same page will not be filtered, such as "www.xxx.com/home/" and "www.xxx.com/home/index.html".

answered Nov 27 '12 at 06:32

JavaNoScript

2,345
21
27

score 1 · Answer 3 · answered Oct 06 '10 at 11:20

This is straight forward. Maintain all your previously crawled urls in python dict. So when you try to try them next time, see if that url is there in the dict. else crawl.

def load_urls(prev_urls):
    prev = dict()
    for url in prev_urls:
        prev[url] = True
    return prev

def fresh_crawl(prev_urls, new_urls):
    for url in new_urls:
        if url not in prev_urls:
            crawl(url)
    return

def main():
    purls = load_urls(prev_urls)
    fresh_crawl(purls, nurls)
    return

The above code was typed in SO text editor aka browser. Might have syntax errors. You might also need to make a few changes. But the logic is there...

NOTE: But beware that some websites constantly keep changing their content. So sometimes you might have to recrawl a particular webpage (i.e. same url) just to get the updated content.

It would be better to use a set than a dict in this case. – Joshua Snider Jun 27 '15 at 17:29 — Joshua Snider, Jun 27 '15 at 17:29

score 1 · Answer 4 · answered Jan 12 '12 at 06:24

I think jama22's answer is a little incomplete.

In the snippet if self.FILTER_VISITED in x.meta:, you can see that you require FILTER_VISITED in your Request instance in order for that request to be ignored. This is to ensure that you can differentiate between links that you want to traverse and move around and item links that well, you don't want to see again.

score 0 · Answer 5 · answered Jan 08 '20 at 02:53

0

For today (2019), this post is the best answer for this problem.

https://blog.scrapinghub.com/2016/07/20/scrapy-tips-from-the-pros-july-2016

It's a lib to handle MIDDLEWARES automatcally.

Hope to help someone. I've spent a lot of time seaching for this.

answered Jan 08 '20 at 02:53

Cristian Favaro Carriço

159
2
15

Scrapy - how to identify already scraped urls

5 Answers5

Linked