When I parse the file
<html>
<head><meta charset="UTF-8"></head>
<body><a href="Düsseldorf.html">Düsseldorf</a></body>
</html>
using
item = SimpleItem()
item['name'] = response.xpath('//a/text()')[0].extract()
item["url"] = response.xpath('//a/@href')[0].extract()
return item
I end up with either \u
escapes
[{
"name": "D\u00fcsseldorf",
"url": "D\u00fcsseldorf.html"
}]
or with percent-encoded strings
D%C3%BCsseldorf
The item exporter described here
# -*- coding: utf-8 -*-
import json
from scrapy.contrib.exporter import BaseItemExporter
class UnicodeJsonLinesItemExporter(BaseItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs)
self.file = file
self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)
def export_item(self, item):
itemdict = dict(self._get_serialized_fields(item))
self.file.write(self.encoder.encode(itemdict) + '\n')
along with the appropriate feed exporter setting
FEED_EXPORTERS = {
'json': 'myproj.exporter.UnicodeJsonLinesItemExporter',
}
do not help.
How do I get a utf-8-encoded JSON output?
I'm reiterating/expanding an unanswered question.
Update:
Orthogonal to Scrapy, note that without setting
export PYTHONIOENCODING="utf_8"
running
> echo { \"name\": \"Düsseldorf\", \"url\": \"Düsseldorf.html\" } > dorf.json
> python -c'import fileinput, json;print json.dumps(json.loads("".join(fileinput.input())),sort_keys=True, indent=4, ensure_ascii=False)' dorf.json > dorf_pp.json
will fail with
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 16: ordinal not in range(128)
Update
As posted my question was unanswerable. The UnicodeJsonLinesItemExporter works, but another part of the pipeline was the culprit: As a post-process to pretty-print the JSON output, I was was using python -m json.tool in.json > out.json
.