How to properly decode a utf-8 string with mixed-in octal escapes?

Asked Jun 18 '20 at 17:00

Active Jul 04 '20 at 20:05

Viewed 346 times

When parsing /proc/self/mountinfo on Linux some fields of each line describing each a mount may very well contain utf-8 encoded characters. Since the line format of mountinfo separates fields by spaces, mountinfo escapes at least (space) and \ (backslash) as "\040" and "\134" (literally!). How can I convert a field value ("/tmp/a\ ", Python string '/tmp/a\\134\\040') back into a non-escaped string?

Is there a better way than the following rather involved one (from https://stackoverflow.com/a/26311382)? That is, with less encoding/decoding chaining?

>>> s='/tmp/a\\134\\040'
>>> s.encode().decode('unicode-escape').encode('latin-1').decode('utf-8')
'/tmp/a\\ '

PS: Don't ask why anyone sane would use such path names; this is just for illustrational purposes ;)

edited Jul 04 '20 at 20:05

asked Jun 18 '20 at 17:00

TheDiveO

2,183
2
19
38

1

I don't think you can do better than that. I say be happy you don't have to use regex! – lenz Jun 18 '20 at 17:47
If it helps, you can omit the last argument `'utf8'` to the final `.decode()`, like you did for the first `.encode()`. – lenz Jun 18 '20 at 17:48

How to properly decode a utf-8 string with mixed-in octal escapes?

0 Answers0