0

I am trying to take a users input in octal UTF-8 bytes and convert them to normal UTF-8 characters. The input is being taken from an entry field(field) in tkinter, this is how I am processing it:

input = ((self.field.get(1.0,END)).split('\n\')))
print (bytes(input[0], 'utf-8').decode('unicode_escape'))

for the example character \350\260\242 this prints "è ° ¢" when it should print 谢.

b'\350\260\252'.decode('utf-8')

returns the correct character but this is useless as I am trying to take a users input. Is there any way to take a user's input directly as bytes or is there a better way to do my decodings? any help is appreciated

User9123
  • 3
  • 1
  • http://stackoverflow.com/questions/14820429/how-do-i-decodestring-escape-in-python3 – Josh Lee Jan 26 '17 at 22:53
  • "for the example character \350\260\242 this prints "è ° ¢" when it should print 谢." I cannot reproduce this; I see `谢` output. This problem is caused by your terminal. I ran into issues like this before, too. They seem to have been Windows-specific, and resolved in more recent versions. – Karl Knechtel Aug 05 '22 at 03:07

1 Answers1

0

Yeah, unicode_escape is a bit weird in that in converts from a bytestring of escape sequences to a unicode string (which makes sense, since that's what it's for). You could use the "round-trip through latin-1 mojibake" trick:

>>> br'\350\260\252'.decode('unicode_escape')
'è°ª'
>>> _.encode('l1').decode('u8')
'谪'

(Which works because latin-1 is a 1-to-1 mapping of the first 256 code points.)

And there's also the undocumented codecs.escape_decode:

>>> codecs.escape_decode(br'\350\260\252')[0].decode()
'谪'

Naturally, both of these codecs are inherently tailored towards python syntax in particular, so you'll have to roll your own to just handle octal escapes.

Josh Lee
  • 171,072
  • 38
  • 269
  • 275