Decoding string to UTF-8 if string is already defined as r'string' instead of b'string'

Question

Reading path strings from a file, the file contains strings like this which have escaped special unicode characters: /WAY-ALPHA2019-Espan\314\203ol-Episodio-01.mp4

I need to convert that string of characters into this: /WAY-ALPHA2019-Español-Episodio-01.mp4

Here is some code demonstrating what I am trying to do:

>>> stringa = r'/WAY-ALPHA2019-Espan\314\203ol-Episodio-01.mp4'
>>> stringb = b'/WAY-ALPHA2019-Espan\314\203ol-Episodio-01.mp4'

>>> print(stringa)
/WAY-ALPHA2019-Espan\314\203ol-Episodio-01.mp4
>>> print(stringb)
b'/WAY-ALPHA2019-Espan\xcc\x83ol-Episodio-01.mp4'

>>> print(stringa.decode('utf8'))
Traceback (most recent call last):
  File "C:\Users\arlin\AppData\Local\Programs\Python\Python310-32\lib\code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?

>>> print(stringb.decode('utf8'))
/WAY-ALPHA2019-Español-Episodio-01.mp4

`decode` is a method that applies to byte strings, not Unicode strings. For example, a Unicode string 'Español' is converted to a UTF-8 bytes string thus: 'encoded = Español'.encode('utf-8') and the resulting byte string is converted back to a Unicode string by calling `s = encoded.decode('utf-8')`. — Booboo, Jan 15 '22 at 20:43
The "r'string'" terminology here is confusing if you are *actually* asking how to read lines from a text file. — tripleee, Jan 16 '22 at 10:10

Jasmijn · Accepted Answer · 2022-01-16T07:35:09.010

1

Try this:

import re
re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), stringa.encode('ascii')).decode('utf-8')

Explanation:

We use the regular expression rb'\\([0-7]{3})' (which matches a literal backslash \ followed by exactly 3 octal digits) and replace each occurrence by taking the three digit code (match[1]), interpreting that as a number written in octal (int(_, 8)), and then replacing the original escape sequence with a single byte (bytes([_])).

We need to operate over bytes because the escape codes are of raw bytes, not unicode characters. Only after we "unescaped" those sequences, can we decode the UTF-8 to a string.

edited Jan 16 '22 at 07:35

answered Jan 15 '22 at 20:52

Jasmijn

9,370
2
29
43

what is `x` above in `int(x[match], 8)`? – Arlin Sandbulte Jan 15 '22 at 23:16
I'm trying to use the suggested solution, but I cannot figure it out. Here is what I am trying: `new_str = re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(x[match], 8)]), old_str.encode('ascii')).decode('utf-8')` – Arlin Sandbulte Jan 15 '22 at 23:47
I got it! ```new_string = re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), old_string.encode('ascii')).decode('utf-8')``` – Arlin Sandbulte Jan 16 '22 at 01:46
Ah yes I made a copy/paste error, apologies, but you got there on your own! – Jasmijn Jan 16 '22 at 07:38

Arlin Sandbulte · Answer 2 · 2022-01-16T02:13:34.867

0

I figured it out.
code from @Jasmijn had a bug/typo. Here is the working code:
UPDATED: In my case, old_string could include utf-8 chars, so I had to change .encode('ascii') to .encode('utf-8'), which still works for me.

import re
new_string = re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), old_string.encode('utf-8')).decode('utf-8')

edited Jan 16 '22 at 02:13

answered Jan 16 '22 at 01:50

Arlin Sandbulte

3
3

Decoding string to UTF-8 if string is already defined as r'string' instead of b'string'

2 Answers2