13

I have a dictionary data where I have stored:

  • key - ID of an event

  • value - the name of this event, where value is a UTF-8 string

Now, I want to write down this map into a json file. I tried with this:

with open('events_map.json', 'w') as out_file:
    json.dump(data, out_file, indent = 4)

but this gives me the error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 0: invalid start byte

Now, I also tried with:

with io.open('events_map.json', 'w', encoding='utf-8') as out_file:
   out_file.write(unicode(json.dumps(data, encoding="utf-8")))

but this raises the same error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 0: invalid start byte

I also tried with:

with io.open('events_map.json', 'w', encoding='utf-8') as out_file:
    out_file.write(unicode(json.dumps(data, encoding="utf-8", ensure_ascii=False)))

but this raises the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xbf in position 3114: ordinal not in range(128)

Any suggestions about how can I solve this problem?

EDIT: I believe this is the line that is causing me the problem:

> data['142']
'\xbf/ANCT25'

EDIT 2: The data variable is read from a file. So, after reading it from a file:

data_file_lines = io.open(file_name, 'r', encoding='utf8').readlines()

I then do:

with io.open('data/events_map.json', 'w', encoding='utf8') as json_file:
        json.dump(data, json_file, ensure_ascii=False)

Which gives me the error:

TypeError: must be unicode, not str

Then, I try to do this with the data dictionary:

for tuple in sorted_tuples (the `data` variable is initialized by a tuple):
    data[str(tuple[1])] = json.dumps(tuple[0], ensure_ascii=False, encoding='utf8')

which is, again, followed by:

with io.open('data/events_map.json', 'w', encoding='utf8') as json_file:
    json.dump(data, json_file, ensure_ascii=False)

but again, the same error:

TypeError: must be unicode, not str

I get the same error when I use the simple open function for reading from the file:

data_file_lines = open(file_name, "r").readlines()
Belphegor
  • 4,456
  • 11
  • 34
  • 59
  • The string in your `data` dictionary is not actually UTF-8 encoded; decoding it to Unicode fails. – Martijn Pieters Aug 04 '14 at 15:47
  • Can you please put the actual `data` dictionary in your post? Just include the output of `print data`. – Martijn Pieters Aug 04 '14 at 15:50
  • The `data` variable is too big to paste it. Anyway, I think only one entry on my dictionary is causing the problem. I edited my post. – Belphegor Aug 04 '14 at 15:53
  • That string is indeed *not* UTF-8 encoded. Is that supposed to be an [inverted question mark](http://codepoints.net/U+00BF), perhaps? – Martijn Pieters Aug 04 '14 at 15:54
  • You'll have to either replace that value with an actual UTF-8 encoded value, or replace it with a Unicode value (so explicitly decode it first before passing it to `json.dump()`). – Martijn Pieters Aug 04 '14 at 15:56

2 Answers2

20

The exception is caused by the contents of your data dictionary, at least one of the keys or values is not UTF-8 encoded.

You'll have to replace this value; either by substituting a value that is UTF-8 encoded, or by decoding it to a unicode object by decoding just that value with whatever encoding is the correct encoding for that value:

data['142'] = data['142'].decode('latin-1')

to decode that string as a Latin-1-encoded value instead.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • I read these values from a file. You were correct about the inverted question mark, so I changed that value with another UTF-8 character (the letter "é"). With your solution `data['142'].decode('latin-1')` it doesn't raise any errors, but in the final json file I have "142": "\u00e9ANCT25", instead of the expected: "142": "éANCT25". I tried to read the file with codecs.open(file_name, "r", "utf-8"), but here I have: `UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 2526468: invalid continuation byte`. How do I solve this prob. so the real characters are written in the json? – Belphegor Aug 04 '14 at 16:26
  • 1
    `\u00e9` is a valid JSON escape sequence; do you absolutely *have* to have the Unicode character instead of the JSON `\uxxxx` escape sequence? – Martijn Pieters Aug 04 '14 at 16:29
  • @Belphegor: see [Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence](http://stackoverflow.com/a/18337754) for how to produce such data. – Martijn Pieters Aug 04 '14 at 16:30
  • Thanks for the help, but this didn't help me. It still doesn't work. I edited my question where I describe what else I've tried (in "Edit 2"). Any other suggestion? – Belphegor Aug 05 '14 at 07:45
  • 2
    Never mind, I've solved it finally! I got the answer from here: http://stackoverflow.com/questions/12309269/write-json-data-to-file-in-python/14870531#14870531 (the code for Python 2.x). Anyway, @Martijn Pieters , I wouldn't have done it without you, so I am accepting your answer. But, please add the answer from the link I've provided in your answer, so it would be clearer if someone else bumps into the same problem. Cheers! – Belphegor Aug 05 '14 at 08:01
  • FYI: I already edited your answer with the final version of my code, but I don't know if it's going to be approved by the moderators. Anyway, tnx for the help! – Belphegor Aug 05 '14 at 08:42
  • Thanks, that answer at http://stackoverflow.com/questions/12309269/write-json-data-to-file-in-python/14870531#14870531 worked for me too! – Blairg23 Dec 22 '15 at 18:45
0

I encountered the same error while opening a simple text file. What worked for me was switching the open encoding to "latin1".

Schroeder
  • 742
  • 8
  • 19