You have Mojibake data here; UTF-8 data decoded from bytes with the wrong codec.
The trick is to figure out which encoding was used to decode, before producing the JSON output. The first two samples can be repaired if you assume the encoding was Windows Codepage 1252:
>>> sample = u'''\
... d\u00c3\u00a9cor
... business\u00e2\u20ac\u2122 active accounts
... the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label
... '''.splitlines()
>>> print sample[0].encode('cp1252').decode('utf8')
décor
>>> print sample[1].encode('cp1252').decode('utf8')
business’ active accounts
but this codec fails for the 3rd:
>>> print sample[2].encode('cp1252').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x9d' in position 24: character maps to <undefined>
The first 3 'weird' bytes are certainly a CP1252 Mojibake for the U+201C LEFT DOUBLE QUOTATION MARK codepoint:
>>> sample[2]
u'the \xe2\u20ac\u0153Made in the USA\xe2\u20ac\x9d label'
>>> sample[2][:22].encode('cp1252').decode('utf8')
u'the \u201cMade in the USA'
so the other combo is presumably meant to be U+201D RIGHT DOUBLE QUOTATION MARK, but the latter character results in a UTF-8 byte not normally present in CP1252:
>>> u'\u201d'.encode('utf8').decode('cp1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>
That's because there is no hex 9D position in the CP1252 codec, but the codepoint did make it into the JSON output:
>>> sample[2][22:]
u'\xe2\u20ac\x9d label'
The ftfy
library Ned Batchelder so helpfully alerted me to uses a 'sloppy' CP1252 codec to work around that issue, mapping non-existing bytes one-on-one (UTF-8 byte to Latin-1 Unicode point). The resulting 'fancy quotes' are then mapped to ASCII quotes by the library, but you can switch that off:
>>> import ftfy
>>> ftfy.fix_text(sample[2])
u'the "Made in the USA" label'
>>> ftfy.fix_text(sample[2], uncurl_quotes=False)
u'the \u201cMade in the USA\u201d label'
Since this library automates this task for you, and does a better job than the standard Python codecs can do for you here, you should just install it, and apply it to the mess this API hands you. Don't hesitate to berate the people that hand you this data, however, if you have half a chance. They have produced one lovely muck-up.