0

Reading path strings from a file, the file contains strings like this which have escaped special unicode characters: /WAY-ALPHA2019-Espan\314\203ol-Episodio-01.mp4

I need to convert that string of characters into this: /WAY-ALPHA2019-Español-Episodio-01.mp4

Here is some code demonstrating what I am trying to do:

>>> stringa = r'/WAY-ALPHA2019-Espan\314\203ol-Episodio-01.mp4'
>>> stringb = b'/WAY-ALPHA2019-Espan\314\203ol-Episodio-01.mp4'

>>> print(stringa)
/WAY-ALPHA2019-Espan\314\203ol-Episodio-01.mp4
>>> print(stringb)
b'/WAY-ALPHA2019-Espan\xcc\x83ol-Episodio-01.mp4'

>>> print(stringa.decode('utf8'))
Traceback (most recent call last):
  File "C:\Users\arlin\AppData\Local\Programs\Python\Python310-32\lib\code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?

>>> print(stringb.decode('utf8'))
/WAY-ALPHA2019-Español-Episodio-01.mp4
  • `decode` is a method that applies to byte strings, not Unicode strings. For example, a Unicode string 'Español' is converted to a UTF-8 bytes string thus: 'encoded = Español'.encode('utf-8') and the resulting byte string is converted back to a Unicode string by calling `s = encoded.decode('utf-8')`. – Booboo Jan 15 '22 at 20:43
  • The "r'string'" terminology here is confusing if you are *actually* asking how to read lines from a text file. – tripleee Jan 16 '22 at 10:10

2 Answers2

1

Try this:

import re
re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), stringa.encode('ascii')).decode('utf-8')

Explanation:

We use the regular expression rb'\\([0-7]{3})' (which matches a literal backslash \ followed by exactly 3 octal digits) and replace each occurrence by taking the three digit code (match[1]), interpreting that as a number written in octal (int(_, 8)), and then replacing the original escape sequence with a single byte (bytes([_])).

We need to operate over bytes because the escape codes are of raw bytes, not unicode characters. Only after we "unescaped" those sequences, can we decode the UTF-8 to a string.

Jasmijn
  • 9,370
  • 2
  • 29
  • 43
0

I figured it out.
code from @Jasmijn had a bug/typo. Here is the working code:
UPDATED: In my case, old_string could include utf-8 chars, so I had to change .encode('ascii') to .encode('utf-8'), which still works for me.

import re
new_string = re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), old_string.encode('utf-8')).decode('utf-8')