0

what is the right way to do it if the URL has some unicode chars in it, and is escaped in the client side using javascript ( escape(text) )? For example, if my url is: domain.com/?text=%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4

I tried: text = urllib.unquote(request.GET.get('text')) but I got the exact same string back (%u05D0%u05D9%u05DA%20%u05DE ... )

Shay
  • 31
  • 4
  • 2
    Possible duplicate of [How to unquote a urlencoded unicode string in python?](http://stackoverflow.com/questions/300445/how-to-unquote-a-urlencoded-unicode-string-in-python). Short answer: the `%uXXXX` encoding scheme is non-standard, you'll probably have to write your own decoder. – Frédéric Hamidi Dec 22 '10 at 19:51

2 Answers2

3

eventually what I did is changed the client side from escape(text) to urlEncodeComponent(text) and then in the python side used:

request.encoding = 'UTF-8' text = unicode(request.GET.get('text', None))

Not sure this is the best thing to do, but it works in English and Hebrew

Shay
  • 31
  • 4
  • 1
    Yes, `encodeURIComponent()` is the correct function to URL-encode a string; `escape()` is some weirdo custom JavaScript-specific encoding that looks a bit like URL-encoding but isn't at all. – bobince Dec 22 '10 at 20:18
0

Because your %uxxxx is not Python-standard, which is \uxxxx, you need a tricky transform to replace '%' with '\', like following(tested in my Python shell):

>>> import sys; reload(sys); sys.setdefaultencoding('utf8')
<module 'sys' (built-in)>
>>> text = '%u05D0%u05D9%u05DA%20%u05DE%u05DE%u05D9%u05E8%u05D9%u05DD%20%u05D0%u05EA%20%u05D4%u05D8%u05E7%u05E1%u05D8%20%u05D4%u05D6%u05D4'
>>> text = text.replace('%', '\\')
>>> text_u = text.decode('unicode-escape')
>>> print text_u
איךממיריםאתהטקסטהזה

After transformed into Unicode type, You can then transform it to whatever encoding you like, as following:

>>> text_utf8 = text_u.encode('utf8')
>>> text_utf8
'\xd7\x90\xd7\x99\xd7\x9a\x10\xd7\x9e\xd7\x9e\xd7\x99\xd7\xa8\xd7\x99\xd7\x9d\x10\xd7\x90\xd7\xaa\x10\xd7\x94\xd7\x98\xd7\xa7\xd7\xa1\xd7\x98\x10\xd7\x94\xd7\x96\xd7\x94'
>>> print text_utf8
איךממיריםאתהטקסטהזה
AngelIW
  • 23
  • 7