0

As a part of a scraper, I need to encode kanji to URLs, but I just can't seem to even get the correct output from a simple sign, and I'm currently blinded by everything I've tried thus far from various Stack Overflow posts.

The document is set to UTF-8.

sampleText=u'ル'

print sampleText
print sampleText.encode('utf-8')
print urllib2.quote(sampleText.encode('utf-8'))

It gives me the values:

ル
ル
%E3%83%AB

But as far as I understand, it should give me:

ル
XX
%83%8B

What am I doing wrong? Are there some settings I don't have correct? Because as far as I understand it, my output from the encode() should not be ル.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
NegatioN
  • 667
  • 2
  • 9
  • 24

1 Answers1

2

The code you show works correctly. The character is KATAKANA LETTER RU, and is Unicode codepoint U+30EB. When encoded to UTF-8, you'll get the Python bytestring '\xe3\x83\xab', which prints out as ル if your console encoding is Latin-1. When you URL-escape those three bytes, you get %E3%83%AB.

The value you seem to be expecting, %83%8B is the Shift-JIS encoding of , rather than UTF-8 encoding. For a long time there was no standard for how to encode non-ASCII text in a URL, and as this Wikipedia section notes, many programs simply assumed a particular encoding (often without specifying it). The newer standard of Internationalized Resource Identifiers (IRIs) however says that you should always convert Unicode text to UTF-8 bytes before performing percent encoding.

So, if you're generating your encoded string for a new program that wants to meet the current standards, stick with the UTF-8 value you're getting now. I would only use the Shift-JIS version if you need it for backwards compatibility with specific old websites or other software that expects that the data you send will have that encoding. If you have any influence over the server (or other program), see if you can update it to use IRIs too!

Community
  • 1
  • 1
Blckknght
  • 100,903
  • 11
  • 120
  • 169