0

A string variable sometimes includes octal characters that need to be un-octaled. Example: oct_var = "String\302\240with\302\240octals", the value of oct_var should be "String with octals" with non-breaking spaces.

Codecs doesn't support octal, and I failed to find a working solution with encode(). The strings originate upstream outside my control.

Python 3.9.8

Edited to add: It doesn't have to scale or be ultra fast, so maybe the idea from here (#6) can work (not tested yet):

def decode(encoded):
    for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
        encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
    return encoded.decode('utf8')
  • Does this answer your question? [Convert octal representation of UTF-8](https://stackoverflow.com/questions/50621340/convert-octal-representation-of-utf-8) – Pablo Díaz Dec 23 '21 at 04:47
  • I have the same problem as the OP there -- "i cant write b'\320...\271' cuz i get the octal values as a string object dynamically". – calisprontix Dec 23 '21 at 05:17
  • One link from there (thx) this looks like a solution: https://stackoverflow.com/questions/4020539/process-escape-sequences-in-a-string-in-python/24519338#24519338, scroll to "Adding a regular expression to solve the problem" and below. – calisprontix Dec 23 '21 at 05:47
  • see updated answer, given the info that "the strings originate upstream outside my control". – Pierre D Dec 23 '21 at 14:38

2 Answers2

2

You forgot to indicate that oct_var should be given as bytes:

>>> oct_var = b"String\302\240with\302\240octals"
>>> oct_var.decode()
'String\xa0with\xa0octals'
>>> print(oct_var.decode())
String with octals

Note: if your value is already as a string (beyond your control), you can try to convert it to bytes:

>>> oct_str = "String\302\240with\302\240octals"  # as a string
>>> oct_var = bytes([ord(c) for c in oct_str])
# often equivalent to:
>>> oct_var = oct_str.encode('Latin1')

and then proceed as above.

Note, if the string also contains chars beyond ASCII, (e.g., with Latin1, accented chars like 'é'), the subsequent .decode() will fail, as in UTF-8 those are represented as multibyte chars (e.g. 'é'.encode() == b'\xc3\xa9', but 'é'.encode('Latin1') == b'\xe9'). If the string contains Unicode chars beyond Latin1 (e.g. '你好'), you will get a ValueError or a UnicodeEncodeError, depending on which of the two conversion methods you choose).

In short: don't fly anything expensive, heavy, or with people inside with that -- this is hacky. At the very least, surround your code with try ... except (ValueError, UnicodeEncodeError, UnicodeDecodeError) and handle these exceptions accordingly.

Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • 1
    This touches an important point. Those codes (`\302\240`) are the UTF-8 representation of the non-breaking space, U+0040. If those octal codes are present in a Unicode string, you have a problem. You really need to have that as a bytes string, in which case the conversion is easy, as Pierre has correctly shown. – Tim Roberts Dec 23 '21 at 04:46
  • We are stuck with • solutions that will not vanish suddenly but are not reliable, and • something that is stable enough for the Python devs[1] but can be gone any time without notice. `codecs.escape_decode` could go inside a `try ... except` that fires some sort of notification when it is called but no longer provided in Python. Maybe by then we have a replacement. [1] "…we can not just remove it while it is used in the pickle module, and there is no reason to change it as it works pretty good for its purpose…" – calisprontix Dec 23 '21 at 16:18
0

Putting your ideas and pointers together, and with the risks that come with the use of an undocumented function[*], i.e, codecs.escape_decode, this line works:

value = (codecs.escape_decode(bytes(oct_var, "latin-1"))[0].decode("utf-8"))

[*] "Internal function means: you can use it on your risk but the function can be changed or even removed in any Python release."

Explanations for for codecs.escape_decode:

https://stackoverflow.com/a/37059682/5309571

Examples for its use:

https://www.programcreek.com/python/example/8498/codecs.escape_decode

Other approaches that may turn out to be more future-proof than codecs.escape_decode (no warranty, I have not tried them):

https://stackoverflow.com/a/58829514/5309571

https://bytes.com/topic/python/answers/743965-converting-octal-escaped-utf-8-a