2

I have a filename that contains %ed%a1%85%ed%b7%97.svg and want to decode that to its proper string representation in Python 3. I know the result will be .svg but the following code does not work:

import urllib.parse
import codecs

input = '%ed%a1%85%ed%b7%97.svg'
unescaped = urllib.parse.unquote(input)
raw_bytes = bytes(unescaped, "utf-8")
decoded = codecs.escape_decode(raw_bytes)[0].decode("utf-8")
print(decoded)

will print ������.svg. It does work, however, when input is a string like %e8%b7%af.svg for which it will correctly decode to 路.svg.

I've tried to decode this with online tools such as https://mothereff.in/utf-8 by replacing % with \x leading to \xed\xa1\x85\xed\xb7\x97.svg. The tool correctly decoded this input to .svg.

What happens here?

Mark Gaensicke
  • 504
  • 5
  • 16
  • unicode character in url encoded (percent encoding) format https://stackoverflow.com/a/2742985/10254804 – deadvoid Oct 21 '18 at 09:05
  • yes, except that it is not decodable in python? – Mark Gaensicke Oct 21 '18 at 09:18
  • which online tool you use to output `\xed\xa1\x85\xed\xb7\x97` to ''? that mothereff.in link outputs `\x25\x65\x64\x25\x61\x31\x25\x38\x35\x25\x65\x64\x25\x62\x37\x25\x39\x37`, btw – deadvoid Oct 22 '18 at 10:22
  • when using the mothereff.in tool to _decode_ `\xed\xa1\x85\xed\xb7\x97.svg` then it properly shows `.svg`. Encoding it again shows `\xF0\xA1\x97\x97\x2E\x73\x76\x67` of which `\xF0\xA1\x97\x97` is the character, `\x2E` is the period `.` and `\x73\x76\x67` is `svg` – Mark Gaensicke Oct 22 '18 at 12:07

1 Answers1

3

you need the correct encoding to get command line console/terminal (which supports & configured for utf-8) to display the correct characters

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
PEP 263 -- Defining Python Source Code Encodings: https://www.python.org/dev/peps/pep-0263/
https://stackoverflow.com/questions/3883573/encoding-error-in-python-with-chinese-characters#3888653
"""
from urllib.parse import unquote

urlencoded = '%ed%a1%85%ed%b7%97'

char = unquote(urlencoded, encoding='gbk')
char1 = unquote(urlencoded, encoding='big5_hkscs')
char2 = unquote(urlencoded, encoding='gb18030')

print(char)
print(char1)
print(char2)

# 怼呿窏
# 瞴�窾�
# 怼呿窏

this is a quite an exotic unicode character, and i was wrong about the encoding, it's not a simplified chinese char, it's traditional one, and quite far in the mapping as well \U215D7 - CJK UNIFIED IDEOGRAPHS EXTENSION B.
but the code point listed & other values made me suspicious this was a poorly encoded code, so it took me a while.
someone helped me figuring how the encoding got to that form. you need to do a few encoding transforms to revert it back to its original value.

cjk = unquote_to_bytes(urlencoded).decode('utf-8', 'surrogatepass').encode('utf-16', 'surrogatepass').decode('utf-16')
print(cjk) 
deadvoid
  • 1,270
  • 10
  • 19
  • I'm sorry but could you please read my question again? My terminal is set to UTF-8 properly. Also the desired output is none of the ones you gave as example. – Mark Gaensicke Oct 22 '18 at 04:02
  • 1
    hmm maybe i got the wrong encoding, i thought those three are all that's available for chinese char encoding... sorry, that comment about terminal encoding wasn't to imply on your settings, i included it for the completeness sake, to complement the link to discussion thread i put in there. i need to dig up my bookmark to find the online converter link i saved a long time ago, i'll edit my answer later. – deadvoid Oct 22 '18 at 04:42