0

I am retrieving Japanese and Chinese text from a website in the form of JSON using urllib2 and converting them to HTML entities using encode(xmlcharrefreplace).

I then use curl to post the same content(after making minor changes) back on the website using percent encoding. My code works fine for English text with special characters, but I need to convert all Japanese/Chinese characters from html encoding to percent encoding.

Is there a function in Python which could do this magic?

PS: For English text, I have my own function to convert special chars to percent encoding. I cannot use this method for the Japanese/Chinese characters as there are too many of them.

koolkat
  • 706
  • 3
  • 8
  • 23
  • Seems that these two posts should get you running: http://stackoverflow.com/questions/275174/how-do-i-perform-html-decoding-encoding-using-python-django and http://stackoverflow.com/questions/1695183/how-to-percent-encode-url-parameters-in-python – matiasg Mar 05 '15 at 18:28
  • I am looking to convert html entities to percent encoding. The posts you mentioned have suggestions only to convert special chars to percent encoding. – koolkat Mar 05 '15 at 19:06
  • Could you provide an example of the text you have and the text you need? – matiasg Mar 05 '15 at 19:27
  • For example, I have: 命令で I need this is the following format (the encoding is not right, this is just an example) %2D%EF%4E – koolkat Mar 05 '15 at 22:58
  • So this is what I had in mind. In the first post, there is an answer with HTMLParser().unescape() (for Python2). Then you need to encode to utf8, or the encoding you prefer, and then urllib.quote (again, Python2), as @User shows. – matiasg Mar 06 '15 at 12:07

1 Answers1

1

You want to combine two things:

  1. HTML decoding
  2. URL encoding

Here is an example (Python3):

>>> import html
>>> html.unescape('{')
'{'
>>> import urllib.parse
>>> urllib.parse.quote('{')
'%7B'
User
  • 14,131
  • 2
  • 40
  • 59