2

I need to encode encode Japanese words to fit the encoding of the word in a link. The problem is when I encode them they're slightly off.

I need it to be:

%E5%A4%89%E4%BD%93

instead it's:

b'\xe5\xa4\x89\xe4\xbd\x93'

What can I do to get it to work?

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • See the dupe target for a stdlib function that will do what you want for special characters. Or, alternatively, for *all* characters: `''.join(["%%%02X" % ord(c) for c in s])` – jedwards Mar 13 '15 at 20:22

1 Answers1

1

The output starting with b isn't an encoding per se. It's just how Python represents the raw bytestring. If you type

print b'\xe5\xa4\x89\xe4\xbd\x93'

you actually get 変体 (if you have a decent terminal/font). The encoding is actually utf-8.

>>> x=b'\xe5\xa4\x89\xe4\xbd\x93'
>>> y=u'変体'
>>> x.decode('utf-8') == y
True

But then again, if you do

>>> urllib.quote(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1285, in quote
    return ''.join(map(quoter, s))
KeyError: u'\u5909'

Because of this issue. So have to go back to the bytestring anyway:

>>> urllib.quote(y.encode('utf-8'))
'%E5%A4%89%E4%BD%93'
kojiro
  • 74,557
  • 19
  • 143
  • 201