0

When I parse the file

<html>
    <head><meta charset="UTF-8"></head>
    <body><a href="Düsseldorf.html">Düsseldorf</a></body>
</html>

using

item = SimpleItem()
item['name'] = response.xpath('//a/text()')[0].extract()
item["url"] = response.xpath('//a/@href')[0].extract()
return item

I end up with either \u escapes

[{
    "name": "D\u00fcsseldorf",
    "url": "D\u00fcsseldorf.html"
}]

or with percent-encoded strings

D%C3%BCsseldorf

The item exporter described here

# -*- coding: utf-8 -*-
import json
from scrapy.contrib.exporter import BaseItemExporter

class UnicodeJsonLinesItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)

    def export_item(self, item):
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict) + '\n')

along with the appropriate feed exporter setting

FEED_EXPORTERS = {
    'json': 'myproj.exporter.UnicodeJsonLinesItemExporter',
}

do not help.

How do I get a utf-8-encoded JSON output?

I'm reiterating/expanding an unanswered question.

Update:

Orthogonal to Scrapy, note that without setting

export PYTHONIOENCODING="utf_8"

running

> echo { \"name\": \"Düsseldorf\", \"url\": \"Düsseldorf.html\" } > dorf.json
> python -c'import fileinput, json;print json.dumps(json.loads("".join(fileinput.input())),sort_keys=True, indent=4, ensure_ascii=False)' dorf.json > dorf_pp.json

will fail with

Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)

Update

As posted my question was unanswerable. The UnicodeJsonLinesItemExporter works, but another part of the pipeline was the culprit: As a post-process to pretty-print the JSON output, I was was using python -m json.tool in.json > out.json.

Community
  • 1
  • 1
Calaf
  • 10,113
  • 15
  • 57
  • 120
  • Scrapy 1.2 (yet to be released) will have a `FEED_EXPORT_ENCODING` setting option todo that: see https://github.com/scrapy/scrapy/pull/2034 . In the meantime, you can use the master branch of scrapy – paul trmbrth Sep 19 '16 at 07:48
  • There's this implementation also: https://github.com/scrapy/scrapy/issues/1963#issuecomment-215797219 – paul trmbrth Sep 19 '16 at 07:49

2 Answers2

2
>>> a = [{
    "name": "D\u00fcsseldorf",
    "url": "D\u00fcsseldorf.html"
}]
>>> a
[{'url': 'Düsseldorf.html', 'name': 'Düsseldorf'}]
>>> json.dumps(a, ensure_ascii=False)
'[{"url": "Düsseldorf.html", "name": "Düsseldorf"}]'
Asish M.
  • 2,588
  • 1
  • 16
  • 31
  • That seems indeed to be the way, and it's already incorporated in the item exporter, as you see in the question. Any suggestions why the output remains a mixture of \u-encoded and percent-encoded strings? – Calaf Sep 18 '16 at 06:16
  • I would consider including `urllib.parse.unquote` somewhere to convert the percent-encoded strings. Could you give an example where there's both \u-encoded and percent-encoded strings? – Asish M. Sep 18 '16 at 06:27
  • I've tried using `urllib.parse.unquote` in `export_item` of the item exporter. The trouble with this framework is that there are so many layers of code and hooks it's not very clear how to get something that is otherwise straightforward working. – Calaf Sep 18 '16 at 13:32
1

this seems to work for me

# -*- coding: utf-8 -*-
import scrapy
import urllib

class SimpleItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

class CitiesSpider(scrapy.Spider):
    name = "cities"
    allowed_domains = ["sitercity.info"]
    start_urls = (
        'http://en.sistercity.info/countries/de.html',
    )

    def parse(self, response):
        for a in response.css('a'):
            item = SimpleItem()
            item['name'] = a.css('::text').extract_first()
            item['url'] = urllib.unquote(
                a.css('::attr(href)').extract_first().encode('ascii')
                ).decode('utf8')
            yield item

using the feed exporter cited in your question, it worked also using another storage

# -*- coding: utf-8 -*-
import json
import io
import os
from scrapy.contrib.exporter import BaseItemExporter
from w3lib.url import file_uri_to_path

class CustomFileFeedStorage(object):

    def __init__(self, uri):
        self.path = file_uri_to_path(uri)

    def open(self, spider):
        dirname = os.path.dirname(self.path)
        if dirname and not os.path.exists(dirname):
            os.makedirs(dirname)
        return io.open(self.path, mode='ab')

    def store(self, file):
        file.close()

class UnicodeJsonLinesItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)

    def export_item(self, item):
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict) + '\n')

(removing the comments if necessary)

FEED_EXPORTERS = {
    'json': 'myproj.exporter.UnicodeJsonLinesItemExporter'
}
#FEED_STORAGES = {
#   '': 'myproj.exporter.CustomFileFeedStorage'
#}
FEED_FORMAT = 'json'
FEED_URI = "out.json"
Wilfredo
  • 1,548
  • 1
  • 9
  • 9
  • The line `a.css('::attr(href)').extract_first().encode('ascii')` (first solution) gives me `UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1: ordinal not in range(128)`. – Calaf Sep 19 '16 at 16:52
  • The custom feed `CustomFileFeedStorage` does not appear in the `FEED_STORAGES` hook. Did you hook it in at all? – Calaf Sep 19 '16 at 17:12
  • Yes, just remove the commented section, please, could you add a sample url to be able to replicate your error? – Wilfredo Sep 19 '16 at 19:22
  • For maximum ease of reproducibility, I was working with a local file (one that I access through `file:///path/to/file.html`). That's how I started the question. Now that works fine, though not without struggling with envvar settings (`PYTHONIOENCODING`, `LC_ALL`, `LANG` require having a setting other than the default, it seems). When I use a `http://domain` URL, the nearly identical set of files fail. Some mystery is lurking. It's as if the entire toolchain available (on a very recent OS X/MacPorts installation) is contaminated with parts that are not unicode-aware. – Calaf Sep 19 '16 at 19:35
  • funny.. I visually just skipped the lines you commented out in your code. Now I see why you kept the code for `CustomFileFeedStorage`. – Calaf Sep 19 '16 at 19:49
  • Thanks.. I'm getting closer. One issue remains. Do you actually get a list in `out.json` or just the dicts? In other words, the json output should be `[ {...} {...} ... {...} ]`, but I get instead `{...} {...} ... {...}`. – Calaf Sep 19 '16 at 20:03
  • You were solving a harder problem than the one I was asking. You were solving the problem starting from http:// while I was trying to focus on file:///. I've now asked, with every possible detail, the question for http:// . Please see http://stackoverflow.com/q/39582409/704972 . Your solution is the one I labeled "attempt 1". I've also tried (as "attempt 2") paul trmbrth's solution, given in his comment above. – Calaf Sep 19 '16 at 21:31