0

I am new to scrape a website and using Scrapy to recursively get all URLs under a domain. I used HtmlXPathSelector

hxs.select('//a/@href').extract() 

to get URLs.

However, I got lots of URLs are very similar to each other. Is there any way to consider these URLs as one website ?

Example: http://infohawk.uiowa.edu:80/F/YY75HHTMTKKDNCBT7JBYQBH64VAFXIDNMS1YT4MRKSVF5A53HK-21930?func=myshelf-short&folder=BASKET&folder_key=BASKET&sort_option=04---A

http://infohawk.uiowa.edu:80/F/YY75HHTMTKKDNCBT7JBYQBH64VAFXIDNMS1YT4MRKSVF5A53HK-09565?func=myshelf-short&folder=BASKET&folder_key=BASKET&sort_option=04---A

I got about 80000 such different URLs, so I am wondering if I have done something wrong ? Other URLs are like :

53HK-39000
53HK-20000

My algorithms are like :

for cur in url_lst:
    if cur in visited:
         continue
    yield Request(cur, callback=self.parse)
demonplus
  • 5,613
  • 12
  • 49
  • 68
Zz_Tree
  • 1
  • 1
  • If you want to filter them accross all parsed sites, [write a custom middleware](http://stackoverflow.com/questions/12553117/how-to-filter-duplicate-requests-based-on-url-in-scrapy). – memoselyk Dec 02 '15 at 06:40
  • How are you getting your url_lst? Scrapy will not revisit URLs queued by Request(). This is controlled by the `dont_filter` parameter in `Request()` which is False by default. ie don't don't filter == filter. Slightly unfortunate choice of negative logic, simply calling it do_filter would have been nicer imho. – Steve Dec 02 '15 at 09:27
  • do you want to get only one item from a set of pages? – eLRuLL Dec 02 '15 at 15:03

0 Answers0