In what world would \\u00c3\\u00a9 become é?

Question

I have a likely improperly encoded json document from a source I do not control, which contains the following strings:

d\u00c3\u00a9cor

business\u00e2\u20ac\u2122 active accounts 

the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label

From this, I am gathering they intend for \u00c3\u00a9 to beceom é, which would be utf-8 hex C3 A9. That makes some sense. For the others, I assume we are dealing with some types of directional quotation marks.

My theory here is that this is either using some encoding I've never encountered before, or that it has been double-encoded in some way. I am fine writing some code to transform their broken input into something I can understand, as it is highly unlikely they would be able to fix the system if I brought it to their attention.

Any ideas how to force their input to something I can understand? For the record, I am working in Python.

@JoranBeasley: they are JSON unicode escapes, yes, but for invalid codepoints. — Martijn Pieters, Oct 28 '14 at 17:14
Are these 3 separate strings? Do you know anything about the platform that produced these? — Martijn Pieters, Oct 28 '14 at 17:14
yeah I see now ... this looks like bad encoding somewhere along the way — Joran Beasley, Oct 28 '14 at 17:14
I know nothing about the code platform that was used to generate it -- but they are not pleasant to work with. — Kevin Dolan, Oct 28 '14 at 17:15

score 16 · Accepted Answer · answered Oct 28 '14 at 17:18

16

You should try the ftfy module:

>>> print ftfy.ftfy(u"d\u00c3\u00a9cor")
décor
>>> print ftfy.ftfy(u"business\u00e2\u20ac\u2122 active accounts")
business' active accounts
>>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label")
the "Made in the USA" label
>>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label", uncurl_quotes=False)
the “Made in the USA” label

answered Oct 28 '14 at 17:18

Ned Batchelder

364,293
75
561
662

1

“You see, it’s occurred to me that there are times when a girl with the ability to read minds might come in handy in our line of work...” - Malcolm Reynolds – Matthias Oct 28 '14 at 18:54

Martijn Pieters · Answer 2 · 2014-10-28T17:51:53.347

You have Mojibake data here; UTF-8 data decoded from bytes with the wrong codec.

The trick is to figure out which encoding was used to decode, before producing the JSON output. The first two samples can be repaired if you assume the encoding was Windows Codepage 1252:

>>> sample = u'''\
... d\u00c3\u00a9cor
... business\u00e2\u20ac\u2122 active accounts 
... the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label
... '''.splitlines()
>>> print sample[0].encode('cp1252').decode('utf8')
décor
>>> print sample[1].encode('cp1252').decode('utf8')
business’ active accounts

but this codec fails for the 3rd:

>>> print sample[2].encode('cp1252').decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x9d' in position 24: character maps to <undefined>

The first 3 'weird' bytes are certainly a CP1252 Mojibake for the U+201C LEFT DOUBLE QUOTATION MARK codepoint:

>>> sample[2]
u'the \xe2\u20ac\u0153Made in the USA\xe2\u20ac\x9d label'
>>> sample[2][:22].encode('cp1252').decode('utf8')
u'the \u201cMade in the USA'

so the other combo is presumably meant to be U+201D RIGHT DOUBLE QUOTATION MARK, but the latter character results in a UTF-8 byte not normally present in CP1252:

>>> u'\u201d'.encode('utf8').decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>

That's because there is no hex 9D position in the CP1252 codec, but the codepoint did make it into the JSON output:

>>> sample[2][22:]
u'\xe2\u20ac\x9d label'

The ftfy library Ned Batchelder so helpfully alerted me to uses a 'sloppy' CP1252 codec to work around that issue, mapping non-existing bytes one-on-one (UTF-8 byte to Latin-1 Unicode point). The resulting 'fancy quotes' are then mapped to ASCII quotes by the library, but you can switch that off:

>>> import ftfy
>>> ftfy.fix_text(sample[2])
u'the "Made in the USA" label'
>>> ftfy.fix_text(sample[2], uncurl_quotes=False)
u'the \u201cMade in the USA\u201d label'

Since this library automates this task for you, and does a better job than the standard Python codecs can do for you here, you should just install it, and apply it to the mess this API hands you. Don't hesitate to berate the people that hand you this data, however, if you have half a chance. They have produced one lovely muck-up.

The third seems like cp936, but Python's codec doesn't seem to support €. — Ignacio Vazquez-Abrams, Oct 28 '14 at 17:22
@IgnacioVazquez-Abrams: no, it is CP1252 with unknown bytes forced through. — Martijn Pieters, Oct 28 '14 at 17:42
For all I know, it's this way intentionally to be able to interop with some old messy giant somewhere else. — Kevin Dolan, Oct 28 '14 at 18:08

In what world would \\u00c3\\u00a9 become é?

2 Answers2

Linked

Related