-1

This text file (30 bytes only, the content is '(Ne pas r\xe9pondre a ce message)') can be opened and inserted in a dict successfully :

import json

d = {}

with open('temp.txt', 'r') as f:
   d['blah'] = f.read()

with open('test.txt', 'w') as f:
    data = json.dumps(d)
    f.write(data)    

But it is impossible to dump the dict into a JSON file (see traceback below). Why?

I tried lots of solutions provided by various SO questions. The closest solution I could get was this answer. When using this, I can dump to file, but then the JSON file looks like this:

# test.txt
{"blah": "(Ne pas r\u00e9pondre a ce message)"}

instead of

# test.txt
{"blah": "(Ne pas répondre a ce message)"}

Traceback:

  File "C:\Python27\lib\json\encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte
[Finished in 0.1s with exit code 1]
Community
  • 1
  • 1
Basj
  • 41,386
  • 99
  • 383
  • 673

1 Answers1

3

Your file is not UTF-8 encoded. It uses a Latin codec, like ISO-8859-1 or Windows Codepage 1252. Reading the file gives you the encoded text.

JSON however requires Unicode text. Python 2 has a separate Unicode type, and byte strings (type str) need to be decoded using a suitable codec. The json.dumps() function uses UTF-8 by default; UTF-8 is a widely used codec for encoding Unicode text data that can handle all codepoints in the standard, and is also the default codec for JSON strings to use (JSON requires documents to be encoding in one of 3 UTF codecs).

You need to either decode the string manually or tell json.dumps() what codec to use for the byte string:

data = json.dumps(d, encoding='latin1')  # applies to all bytestrings in d

or

d['blah'] = d['blah'].decode('latin1')
data = json.dumps(d)

or using io.open() to decode as you read:

import io

with io.open('test.txt', 'w', encoding='latin1') as f:
    d['blah'] = f.read()

By default, the json library produces ASCII-safe JSON output by using the \uhhhh escape syntax the JSON standard allows for. This is entirely normal, the output is valid JSON and readable by any compliant JSON decoder.

If you must produce UTF-8 encoded output without the \uhhhh escape sequences, see Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks, but when using this, the output file looks like this : `{"blah": "(Ne pas r\u00e9pondre a ce message)"}` – Basj Apr 03 '15 at 10:59
  • @Basj: yes, because that's the proper JSON encoding of the data. See [Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence](https://stackoverflow.com/q/18337407) if you want to have the codepoint encoded to UTF-8 instead of using an escape sequence. – Martijn Pieters Apr 03 '15 at 10:59
  • Arghh, this is annoying... I wanted to use JSON to have both 1) something that can be easily loaded into a dict in python 2) human readable in a Text editor... With all the `\u00*`, it won't be human readable anymore in a text editor. I won't be able to "search" for patterns like "répondre" with the standard search tool. – Basj Apr 03 '15 at 11:01
  • I already tried the last link that you gave https://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence in the past 2 hours ;) but unsuccessfully. How would it work here? – Basj Apr 03 '15 at 11:03
  • @Basj: the problem then is that your text editor needs to be consistently configured to handle UTF-8. Note also that the Unicode standard allows for codepoints to be *decomposed* into component parts; a `e` followed by a combining accent acute would display the same but would not necessarily be findable when searching for the `é` composed codepoint. – Martijn Pieters Apr 03 '15 at 11:04
  • 1
    @Basj: `json.dumps(d, encoding='latin1', ensure_ascii=False).encode('utf8')` produces UTF-8 output. – Martijn Pieters Apr 03 '15 at 11:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/74408/discussion-between-basj-and-martijn-pieters). – Basj Apr 03 '15 at 11:08