I am failing alot with Srapys link extractors. E.g:
scrapy shell "http://www.dachser.com/de/de/"
# within shell
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
SgmlLinkExtractor().extract_links(response)
# yields: SGMLParseError: expected name token at '<!/IoRangeRedDotMode'
Now, i only require a list of all links which is why i switched from SgmlLinkExtractor
to the basic HtmlParserLinkExtractor
. This works for the url above, but lets take another url and even this fails:
scrapy shell "http://www.yourfirm.de"
# within shell
from scrapy.contrib.linkextractors.htmlparser import HtmlParserLinkExtractor
HtmlParserLinkExtractor().extract_links(response)
# yields: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
Whats going on here? I plan on extracting the links for various websites so a more foolproof link extraction would be very much welcomed.
Update: Okay, i figured out that the ascii error can be resolved on Windows by setting utf-8
as the systemdefault encoding, see here. Now others fail though.. Like scrapy shell "http://grunwald-wangen.de"
causing UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 17: invalid start byte
.