Python Scrapy Skip Named Anchor And Miss Link

Question

while I was switching from urllib2+bs4 to Scrapy, I noticed a few issues that 'handled smartly' by Scrapy in the default setting. I am not quite sure if I am right or not so correct me if I was wrong.

(1) As default, Scrapy doesn't crawl duplicate URLs, so what is a duplicate URL? I noticed that in the URL that Scrapy crawl, there is no "fragment" or "named anchor", for example, they treat the links below the same. I know this is logical since they are actually the same page but.... I don't know if it would be a good idea for some people who might need this feature.

www.abc.com/page1
www.abc.com/page1#top
www.abc.com/page2#bot

(2) As default, Scrapy follow links that that only under a or area tag. Where it will miss a bunch of URLs under the tag of LINK, I am not a web developer but there might be other tags that might contain URLs that are not in the default settings.

I am not criticizing Scrapy here just want to make sure those two discoveries I have written down are true and not my biased misunderstanding and hope could be helpful for those people who want to the URLs under link tag or who want named anchor.

Thanks!

score 0 · Answer 1 · answered Dec 14 '13 at 06:01

As to your second point (question?), about links in other than a or area tags, see this page in the Scrapy documentation. The gist is that you can specify which tags to look for links in by passing tags to SgmlLinkExtractor(), where tags is a list of strings, which defaults to ('a', 'area').

Python Scrapy Skip Named Anchor And Miss Link

1 Answers1