Scrapy linkextractors fail

Question

I am failing alot with Srapys link extractors. E.g:

scrapy shell "http://www.dachser.com/de/de/"
# within shell
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
SgmlLinkExtractor().extract_links(response)
# yields: SGMLParseError: expected name token at '<!/IoRangeRedDotMode'

Now, i only require a list of all links which is why i switched from SgmlLinkExtractor to the basic HtmlParserLinkExtractor. This works for the url above, but lets take another url and even this fails:

scrapy shell "http://www.yourfirm.de"
# within shell
from scrapy.contrib.linkextractors.htmlparser import HtmlParserLinkExtractor
HtmlParserLinkExtractor().extract_links(response)
# yields: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)

Whats going on here? I plan on extracting the links for various websites so a more foolproof link extraction would be very much welcomed.

Update: Okay, i figured out that the ascii error can be resolved on Windows by setting utf-8 as the systemdefault encoding, see here. Now others fail though.. Like scrapy shell "http://grunwald-wangen.de" causing UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 17: invalid start byte.

The second example works fine for me ! I have Scrapy 0.22.2. — agstudy, Jun 22 '14 at 12:28
I updated to `0.22.2` but it is still the same. I've read that sometimes the stdout can yield encoding errors depending on the terminal. But i do not think this is the case here, for my script i redirected the output to a log `LOG_FILE = "spider.log"` but still the error remains. Python 2.7.7, Scrapy 0.22.2, Windows 7 64-Bit. — bioslime, Jun 22 '14 at 12:48
`HtmlParserLinkExtractor` works with `http://www.google.de`, `http://www.dachser.com/de/de/` for example but fails at `http://www.yourfirm.de` or `http://www.crossvertise.com/werbung/deutschland/werbung-berlin/` for example. — bioslime, Jun 22 '14 at 13:17
That's a problem inside the used webpages. Both examples are working fine with different webpages using `Scrapy 0.22.2`. Have a look at https://groups.google.com/forum/#!topic/scrapy-users/iA1VzcJYpJE for a possible solution. — Christian Berendt, Jun 22 '14 at 18:17

score 1 · Answer 1 · answered Jun 22 '14 at 20:13

The HtmlParserLinkExtractor passes the response.body to the HTMLParser.

Altering the source code so that it recevies response.body_as_unicode() fixes the issue. The doc states that unicode is advised. I made a pull request on github.

As Berendt stated in the comments, the SgmlLinkExtractor seems to choke on some malformed HTMLs.

Scrapy linkextractors fail

1 Answers1