Scrapy convert from unicode to utf-8

Question

I've wrote a simple script to extract data from some site. Script works as expected but I'm not pleased with output format
Here is my code

class ArticleSpider(Spider):
    name = "article"
    allowed_domains = ["example.com"]
    start_urls = (
        "http://example.com/tag/1/page/1"
    )

    def parse(self, response):
        next_selector = response.xpath('//a[@class="next"]/@href')
        url = next_selector[1].extract()
        # url is like "tag/1/page/2"
        yield Request(urlparse.urljoin("http://example.com", url))

        item_selector = response.xpath('//h3/a/@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin("http://example.com", url),
                      callback=self.parse_article)

    def parse_article(self, response):
        item = ItemLoader(item=Article(), response=response)
        # here i extract title of every article
        item.add_xpath('title', '//h1[@class="title"]/text()')
        return item.load_item()

I'm not pleased with the output, something like:

[scrapy] DEBUG: Scraped from <200 http://example.com/tag/1/article_name> {'title': [u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"']}

I think I need to use custom ItemLoader class but I don't know how. Need your help.

TL;DR I need to convert text, scraped by Scrapy from unicode to utf-8

That's just scrapy printing unicode characters (Cyrillic). How are you saving your scraped items? What are you going to do with it once you've saved it? Unicode issues usually depend on what software you're using to view the unicode data with. — Steve, Apr 29 '16 at 14:32
Later I'm going to save it to postgresql database (using pipeline) but for now I'm running it as `scrapy crawl article -o file.json` and I see the same output in json file. Have to admit that I'm new to Scrapy so I appreciate any critic ) — GriMel, Apr 29 '16 at 15:29
related: [Python string prints as `[u'String']`](http://stackoverflow.com/a/36891685/4279) — jfs, Apr 29 '16 at 15:33

neverlastn · Accepted Answer · 2016-04-30T19:29:41.263

As you can see below, this isn't much of a Scrapy issue but more of Python itself. It could also marginally be called an issue :)

$ scrapy shell http://censor.net.ua/resonance/267150/voobscheto_svoboda_zakanchivaetsya

In [7]: print response.xpath('//h1/text()').extract_first()
 "ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ"

In [8]: response.xpath('//h1/text()').extract_first()
Out[8]: u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"'

What you see is two different representations of the same thing - a unicode string.

What I would suggest is run crawls with -L INFO or add LOG_LEVEL='INFO' to your settings.py in order to not show this output in the console.

One annoying thing is that when you save as JSON, you get escaped unicode JSON e.g.

$ scrapy crawl example -L INFO -o a.jl

gives you:

$ cat a.jl
{"title": "\u00a0\"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f\""}

This is correct but it takes more space and most applications handle equally well non-escaped JSON.

Adding a few lines in your settings.py can change this behaviour:

from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
    def __init__(self, file, **kwargs):
        super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)

FEED_EXPORTERS = {
    'jsonlines': 'myproject.settings.MyJsonLinesItemExporter',
    'jl': 'myproject.settings.MyJsonLinesItemExporter',
}

Essentially what we do is just setting ensure_ascii=False for the default JSON Item Exporters. This prevents escaping. I wish there was an easier way to pass arguments to exporters but I can't see any since they are initialized with their default arguments around here. Anyway, now your JSON file has:

$ cat a.jl
{"title": " \"ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ\""}

which is better-looking, equally valid and more compact.

score 0 · Answer 2 · answered May 01 '16 at 06:31

There are 2 independant issues affecting display of unicode string.

if you return a list of strings, the output file will have some issue them because it will use ascii codec by default to serialize list elements. You can work around as below but it's more appropriate to use extract_first() as suggested by @neverlastn
```
class Article(Item):
    title = Field(serializer=lambda x: u', '.join(x))
```

the default implementation of repr() method will serialize unicode string to their escaped version \uxxxx. You can change this behaviour by overriding this method in your item class

class Article(Item):
    def __repr__(self):
        data = self.copy()
        for k in data.keys():
            if type(data[k]) is unicode:
                data[k] = data[k].encode('utf-8')
        return super.__repr__(data)

Scrapy convert from unicode to utf-8

2 Answers2