I am running Python.org version 2.7 64 bit on Windows Vista 64 bit. I have a Scrapy scraper that I was testing on the BBC Sport website that seemed to be working ok. I have since shifted to Wikipedia just to see whether it will work on other sites. The code is below:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
class MySpider(BaseSpider):
name = "bbc"
allowed_domains = ["wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Asia"]
def parse(self, response):
titles = response.selector.xpath("normalize-space(//title)")
for titles in titles:
body = response.xpath("//p").extract()
body2 = "".join(body)
body2 = unicode(body2)
print remove_tags(body2)
I have added the unicode statement because I keep getting errors about a non Unicode character that Command Shell cannot display on all the Wikipedia pages I have looked at so far.
I'm not sure why this statement is not putting my scrape into Unicode and allowing it to be printed. Can anyone see the issue here?
Thanks