2

I am trying to return JSON from the API service from musicbrainz, the returned data for some songs have unicode characters which I am having trouble converting them to regular symbols etc. Kindly let me know what I should be doing here.

JSON:

{
    "status": "ok",
    "results": [{
        "recordings": [{
            "duration": 402,
            "tracks": [{
                "duration": 402,
                "position": 6,
                "medium": {
                    "release": {
                        "id": "dde6ecee-8e9b-4b46-8c28-0f8d659f83ac",
                        "title": "Tecno Fes, Volume 2"
                    },
                    "position": 1,
                    "track_count": 11
                },
                "artists": [{
                    "id": "57c1e5ea-e08f-413a-bcb1-f4e4b675bead",
                    "name": "Gigi D\u2019Agostino"
                }],
                "title": "You Spin Me Round"
            }],
            "id": "2e0a7bce-9e44-4a63-a789-e8c4d2a12af9"
        }, ....

Failed Code (example):

string = '\u0420\u043e\u0441\u0441\u0438\u044f'
print string.encode('utf-8')

I am using this on a windows 7 machine and have python 2.7 and running this code on a command line terminal.. I have the output I get below:

C:\Python27>python junk.py Gigi DGÇÖAgostino Gigi D?Agostino Gigi D\u2019Agostino

I am expecting the output to be Gigi D' Agostino

JonnyJD
  • 2,593
  • 1
  • 28
  • 44
Prem Minister
  • 407
  • 2
  • 10
  • 20
  • What is a "normal" character? – tkone Jan 15 '13 at 19:21
  • I'm not sure exactly what your question is here. I ran the JSON you gave through the standard JSON decoder and the one bit of non-ASCII came out correctly as "Gigi D’Agostino". The "Failed Code", you're just missing a character. If you write `string = u'\u0420\u043e\u0441\u0441\u0438\u044f'`, the variable is properly set to `Россия`. BTW, don't use "string" as a variable name; it can only end in tears. – Michael Lorton Jan 15 '13 at 19:27
  • Do you mean you need ASCII encoded strings? Why? What are you actually doing with this data? – Silas Ray Jan 15 '13 at 19:53
  • I am trying to get some metadata that eventually will be used to manage media, hence the need to encode it properly. the desired output is "Gigi D’Agostino" while all tried codes and methods dont address this issue...
    `print u'Gigi D\u2019Agostino'.encode('utf-8')`'
    `print u'Gigi D\u2019Agostino'.encode('iso-8859-15', 'replace')`
    `a = u'Gigi D\u2019Agostino' import re a = re.sub(r'[\x80-\xFF]+', lambda x: x.group(0).encode('latin1').decode('utf8'), a) print a.encode('utf8')`
    – Prem Minister Jan 15 '13 at 19:57
  • didn't you forget to specify encoding in script? `# -*- encoding: utf-8 -*-` – der_fenix Jan 15 '13 at 19:58
  • Tried all the below methods but none succeded... `# -*- encoding: utf-8 -*-` `print u'Gigi D\u2019Agostino'.encode('utf-8')` `print u'Gigi D\u2019Agostino'.encode('iso-8859-15', 'replace')` `out = 'Gigi D\u2019Agostino' out = out.replace( u'\u2018', u"'") out = out.replace( u'\u2019', u"'") out = out.replace( u'\u201c', u'"') out = out.replace( u'\u201d', u'"') out.encode('ascii') print out` `a = u'Gigi D\u2019Agostino' import re a = re.sub(r'[\x80-\xFF]+', lambda x: x.group(0).encode('latin1').decode('utf8'), a) print a.encode('utf8')` – Prem Minister Jan 15 '13 at 20:06
  • How did they "not succeed"? Did they print anything? Did they raise exceptions? What output do you expect? What platform are you running this on? Are you running this on the terminal/command line, IDLE, Eclipse, something else? – Silas Ray Jan 15 '13 at 20:13
  • 1
    I am using this on a windows 7 machine and have python 2.7 and running this code on a command line terminal.. I have the output I get below: C:\Python27>python junk.py Gigi DGÇÖAgostino Gigi D?Agostino Gigi D\u2019Agostino – Prem Minister Jan 15 '13 at 20:18
  • 1
    I am expecting the output to be --Gigi D' Agostino-- – Prem Minister Jan 15 '13 at 20:20

3 Answers3

1

Unicode escape only works with unicode strings, to convert your regular string to unicode use str.decode('unicode-escape'):

In [1]: s='\u0420\u043e\u0441\u0441\u0438\u044f'

In [2]: s
Out[2]: '\\u0420\\u043e\\u0441\\u0441\\u0438\\u044f'

In [3]: s.decode('unicode-escape')
Out[3]: u'\u0420\u043e\u0441\u0441\u0438\u044f'

In [4]: print s.decode('unicode-escape')
Россия

In [5]: s2="Gigi D\u2019Agostino"

In [6]: s2
Out[6]: 'Gigi D\\u2019Agostino'

In [7]: print s2.decode('unicode-escape')
Gigi D’Agostino
root
  • 76,608
  • 25
  • 108
  • 120
  • 1
    Wonder if this has anything to do with my commandline terminial now: – Prem Minister Jan 15 '13 at 21:05
  • `s2 = 'Gigi D\\u2019Agostino' print s2.decode('unicode-escape')`



    `C:\Python27>python junk.py Traceback (most recent call last): File "junk.py", line 2, in print s2.decode('unicode-escape') File "C:\Python27\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 6: character maps to `

    – Prem Minister Jan 15 '13 at 21:05
  • @PremMinister -- how are you running this? if you copy paste it to a interactive python interpreter does it work? – root Jan 15 '13 at 21:23
  • 1
    `.decode('unicode-escape')` shouldn't be necessary for a json-text. – jfs Jan 16 '13 at 03:15
  • @J.F.Sebastian -- You are of course right. I was for some reason assuming, that for some reason the problem presisted after using a parser, because "how else would he have gotten the string extracted?"... – root Jan 16 '13 at 06:13
  • >>> s = "some\x00string. with\x15 funny characters" >>> import string >>> filter(lambda x: x in string.printable, s) 'somestring. with funny characters' – Prem Minister Jan 21 '13 at 09:09
0

You should use json parser that returns Unicode string as any valid json parser does. Your failing example shows a bytestring i.e., you haven't used a json parser.

For example, to parse json data:

obj = json.load(urllib2.urlopen(request))

To pretty print obj without using Unicode escapes:

print json.dumps(obj, indent=4, ensure_ascii=False)

It is also useful to understand the difference between:

print unicode_string

And:

print repr(unicode_string)
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

You are using the cmd in Windows? In that case it might be a bit of a hack to get Unicode working at all to display correctly. You might want to think about using another "terminal" to test your scripts. MSYS provides a nice terminal/shell and IDLE is included in the Windows Python distribution and has a Python Shell (right click, open in IDLE, F5).

If you really want to make it work in the cmd:

You have to set Lucida Console as font in cmd. Then:

> chcp
Active code page: 850
> chcp 65001

Then you should have unicode output in the cmd. Your "Active code page" might be different. Note that somewhere, because you might want to change it back afterwards:

> chcp 850

Otherwise you will run into other problems (starting .bat files doesn't work). (See also batch-file-encoding)

In your script you also need this:

import codecs

def cp65001(name):
    """This might be buggy, but better than just a LookupError
    """
    if name.lower() == "cp65001":
        return codecs.lookup("utf-8")

codecs.register(cp65001)

Otherwise python will crash. (see windows-cmd-encoding-change-causes-python-crash)

I had a similar bug report for my script.


You might also consider using a library to access the MusicBrainz Web Service. Python-musicbrainzngs works with the current ws/2.

Community
  • 1
  • 1
JonnyJD
  • 2,593
  • 1
  • 28
  • 44