I am extracting data from the website and it has an entry that contains a special character i.e. Comfort Inn And Suites�? Blazing Stump
. When I try to extract it, it throws an error:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
yield it.next()
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 24, in process_spider_output
for x in result:
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 14, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 32, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 48, in <genexpr>
return (r for r in result or () if _filter(r))
File "E:\Scrapy projects\emedia\emedia\spiders\test_spider.py", line 46, in parse
print repr(business.select('a[@class="name"]/text()').extract()[0])
File "C:\Python27\lib\site-packages\scrapy\selector\lxmlsel.py", line 51, in select
result = self.xpathev(xpath)
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:145954)
File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:144987)
File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src\lxml\lxml.etree.c:139973)
File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src\lxml\lxml.etree.c:140328)
File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src\lxml\lxml.etree.c:140524)
File "extensions.pxi", line 784, in lxml.etree._buildElementStringResult (src\lxml\lxml.etree.c:141695)
File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src\lxml\lxml.etree.c:26255)
exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 22: invalid continuation byte
I have tried a lot of different things after searching on the web such as decode('utf-8')
, unicodedata.normalize('NFC',business.select('a[@class="name"]/text()').extract()[0])
but the problem persists?
The source URL is "http://www.truelocal.com.au/find/hotels/97/" and on this page it is fourth entry which I am talking about.