2

There are many questions about utf-8 > unicode conversion, but I still haven't found answer for my issue.

Lets have strings like this:

a = "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"

Python 3.6 understands this string like Je-li pro za\xc5\x99azov\xc3\xa1n\xc3\xad. I need to convert this utf-8-like string to unicode representation. The final result should be Je-li pro zařazování.

With a.decode("utf-8") I get AttributeError: 'str' object has no attribute 'decode', because Python means the object is already decoded.

If I convert it to bytes first with bytes(a, "utf-8"), the backslashes are doubled only and .decode("utf-8") returns it to my current a again.

How to get unicode string Je-li pro zařazování from this a?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Marek L.
  • 323
  • 3
  • 11
  • Doesn't [this](https://stackoverflow.com/questions/1864701/convert-utf-8-octets-to-unicode-code-points) help? (and before you say "no it's not" it doesn't use `bytes(a,"utf-8")`, you need better explanation.) – user202729 Apr 10 '18 at 14:24
  • And... why do you have two ``\``s? – user202729 Apr 10 '18 at 14:25
  • [how-do-i-un-escape-a-backslash-escaped-string-in-python](https://stackoverflow.com/questions/1885181) – user202729 Apr 10 '18 at 14:26
  • Why two backslashes... It is result of one strange API, that returns some characters decoded and some not. – Marek L. Apr 10 '18 at 18:36
  • 1
    Does this answer your question? [Convert "\x" escaped string into readable string in python](https://stackoverflow.com/questions/63218987/convert-x-escaped-string-into-readable-string-in-python) – Karl Knechtel Aug 05 '22 at 03:27

1 Answers1

5

You have to encode/decode 4 times to get the desired result:

print(
  "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"

  # actually any encoding support printable ASCII would work, for example utf-8
  .encode('ascii')

  # unescape the string
  # source: https://stackoverflow.com/a/1885197
  .decode('unicode-escape')

  # latin-1 also works, see https://stackoverflow.com/q/7048745
  .encode('iso-8859-1')

  # finally
  .decode('utf-8')
)

Try it online!

Besides, consider telling your target program (data source) to give different output format (byte array or base64 encoded, for example), if you can.

The unsafe-but-shorter way:

st = "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"
print(eval("b'"+st+"'").decode('utf-8'))

Try it online!

There are ast.literal_eval, but it may not worth using here.

user202729
  • 3,358
  • 3
  • 25
  • 36