Scraping Unique Website With Scrapy

Question

I am new to scrape a website and using Scrapy to recursively get all URLs under a domain. I used HtmlXPathSelector

hxs.select('//a/@href').extract()

to get URLs.

However, I got lots of URLs are very similar to each other. Is there any way to consider these URLs as one website ?

I got about 80000 such different URLs, so I am wondering if I have done something wrong ? Other URLs are like :

53HK-39000
53HK-20000

My algorithms are like :

for cur in url_lst:
    if cur in visited:
         continue
    yield Request(cur, callback=self.parse)

If you want to filter them accross all parsed sites, [write a custom middleware](http://stackoverflow.com/questions/12553117/how-to-filter-duplicate-requests-based-on-url-in-scrapy). — memoselyk, Dec 02 '15 at 06:40
How are you getting your url_lst? Scrapy will not revisit URLs queued by Request(). This is controlled by the `dont_filter` parameter in `Request()` which is False by default. ie don't don't filter == filter. Slightly unfortunate choice of negative logic, simply calling it do_filter would have been nicer imho. — Steve, Dec 02 '15 at 09:27

0 Answers0