11

I have JSON file which contains followingly encoded strings:

"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1",

I am trying to parse this file using the json module. However I am not able to decode this string correctly.

What I get after decoding the JSON using .load() method is 'HornÃ\xadková'. The string should be correctly decoded as 'Horníková' instead.

I read the JSON specification and I understasnd that after \u there should be 4 hexadecimal numbers specifing Unicode number of character. But it seems that in this JSON file UTF-8 encoded bytes are stored as \u-sequences.

What type of encoding is this and how to correctly parse it in Python 3?

Is this type JSON file even valid JSON file according to the specification?

Matej Kormuth
  • 2,139
  • 3
  • 35
  • 52

4 Answers4

10

Your text is already encoded and you need to tell this to Python by using a b prefix in your string but since you're using json and the input needs to be string you have to decode your encoded text manually. Since your input is not byte you can use 'raw_unicode_escape' encoding to convert the string to byte without encoding and prevent the open method to use its own default encoding. Then you can simply use aforementioned approach to get the desired result.

Note that since you need to do the encoding and decoding your have to read file content and perform the encoding on loaded string, then you should use json.loads() instead of json.load().

In [168]: with open('test.json', encoding='raw_unicode_escape') as f:
     ...:     d = json.loads(f.read().encode('raw_unicode_escape').decode())
     ...:     

In [169]: d
Out[169]: {'sender_name': 'Horníková'}
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • Is it possible to do this incrementally (i.e. without reading the whole file first, encode it and then pass it to the JSON parser)? – Samuele Pilleri Nov 10 '18 at 15:17
  • @SamuelePilleri That's in contradiction with the JSON file's nature. However there are couple of efforts out there to make that possible in some cases. Python doesn't support that by default. – Mazdak Nov 10 '18 at 19:53
  • Not 100% sure why (isn't it a recursive descent parser?), anyway what I actually meant is if there is a way to progressively encode/decode the stream without loading it in memory first. – Samuele Pilleri Nov 10 '18 at 21:56
  • Please see [this question](https://stackoverflow.com/questions/53242216/load-a-json-with-raw-unicode-escape-encoded-strings) – Samuele Pilleri Nov 11 '18 at 00:14
7

The JSON you are reading was written incorrectly and the Unicode strings decoded from it will have to be re-encoded with the wrong encoding used, then decoded with the correct encoding.

Here's an example:

#!python3
import json

# The bad JSON you have
bad_json = r'{"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}'
print('bad_json =',bad_json)

# The wanted result from json.loads()
wanted = {'sender_name':'Horníková'}

# What correctly written JSON should look like
good_json = json.dumps(wanted)
print('good_json =',good_json)

# What you get when loading the bad JSON.
got = json.loads(bad_json)
print('wanted =',wanted)
print('got =',got)

# How to correct the mojibake string
corrected_sender = got['sender_name'].encode('latin1').decode('utf8')
print('corrected_sender =',corrected_sender)

Output:

bad_json = {"sender_name": "Horn\u00c3\u00adkov\u00c3\u00a1"}
good_json = {"sender_name": "Horn\u00edkov\u00e1"}
wanted = {'sender_name': 'Horníková'}
got = {'sender_name': 'HornÃ\xadková'}
corrected_sender = Horníková
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
3

I don't know enough about JSON to be able to say whether this is valid or not, but you can parse these strings using the raw_unicode_escape codec:

>>> "Horn\u00c3\u00adkov\u00c3\u00a1".encode('raw_unicode_escape').decode('utf8')
'Horníková'
dorian
  • 5,667
  • 1
  • 19
  • 36
3

Reencode to bytes, and then redecode to text.

>>> 'HornÃ\xadková'.encode('latin-1').decode('utf-8')
'Horníková'

Is this type JSON file even valid JSON file according to the specification?

No.

A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes [emphasis added].

source

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). [...] Any code point may be represented as a hexadecimal escape sequence [...] represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point [emphasis added].

source

UTF-8 byte sequences are neither Unicode characters nor Unicode code points.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358