0

I am running Python.org version 2.7 64 bit on Windows Vista 64 bit. I have a Scrapy scraper that I was testing on the BBC Sport website that seemed to be working ok. I have since shifted to Wikipedia just to see whether it will work on other sites. The code is below:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags

class MySpider(BaseSpider):
    name = "bbc"
    allowed_domains = ["wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Asia"]

    def parse(self, response):
        titles = response.selector.xpath("normalize-space(//title)")
        for titles in titles:
            body = response.xpath("//p").extract()
            body2 = "".join(body)
            body2 = unicode(body2)
            print remove_tags(body2)

I have added the unicode statement because I keep getting errors about a non Unicode character that Command Shell cannot display on all the Wikipedia pages I have looked at so far.

I'm not sure why this statement is not putting my scrape into Unicode and allowing it to be printed. Can anyone see the issue here?

Thanks

gdogg371
  • 3,879
  • 14
  • 63
  • 107
  • It looks like scrapy already returns Unicode, and the `join` shouldn't change that. So `body2 = unicode(body2)` does nothing. You probably have your terminal's encoding set to something that can't handle non-ascii characters - what happens if you encode your content explicitly? `print encode(remove_tags(body2), 'utf-8')`? – Peter DeGlopper Jul 05 '14 at 23:34
  • @PeterDeGlopper hi, thanks for replying. when i use your above suggestion, i get the following error log printed in Command Shell: 'exceptions.NameError: global name 'encode' is not defined' – gdogg371 Jul 05 '14 at 23:38
  • Ah right, should be `remove_tags(body2).encode('utf-8')`. – Peter DeGlopper Jul 06 '14 at 00:59
  • 1
    What's going on here is that `print` on a Unicode object tries to encode it using stdout's encoding, as described here: http://stackoverflow.com/a/2597260/2337736 In this case, that's an encoding that can't handle certain characters. By encoding it to utf-8 you avoid that problem. There's then no guarantee that your window can correctly display utf-8, but Windows has come a long way in their Unicode support. – Peter DeGlopper Jul 06 '14 at 01:08
  • @PeterDeGlopper ok, thanks for that. displaying in command shell is just a temporary step whilst i am getting my eye in with scrapy. my ultimate aim is to run this code via the python IDLE. i am struggling with this at the minute though, as this thread shows: http://stackoverflow.com/questions/24591770/python-shell-not-running-scrapy?noredirect=1#comment38098837_24591770 if you have time could you see if you know the answer to this one for me please? should be easy for someone more experienced. thanks. – gdogg371 Jul 06 '14 at 01:10

0 Answers0