I am new to scrape a website and using Scrapy to recursively get all URLs under a domain. I used HtmlXPathSelector
hxs.select('//a/@href').extract()
to get URLs.
However, I got lots of URLs are very similar to each other. Is there any way to consider these URLs as one website ?
I got about 80000 such different URLs, so I am wondering if I have done something wrong ? Other URLs are like :
53HK-39000
53HK-20000
My algorithms are like :
for cur in url_lst:
if cur in visited:
continue
yield Request(cur, callback=self.parse)