I recently came across this string: b'\xd5\xa3Lk\xd4\xad\xaeH\xb8\xae\xab\xd8EL3\xd1RR\x17\x0c\xea~\xfa\xd0\xc9\xfeJ\x9aq\xd0\xc57\xfd\xfa\x1d}\x8f\x99?*\xef\x88\x1e\x99\x8d\x81t`1\x91\xebh\xc5\x9d\xa7\xa5\x8e\xb9X'
and I wanna decode it.
Now I know this is possible with python with string.decode()
but it requires an encoding. How can I determine the encoding to decode this string?

- 1,089
- 1
- 5
- 21
-
Decode to what? Printable text? What if it's an actual binary blob and there's no "hidden" text or encoding to determine? – Sergio Tulentsev Mar 19 '21 at 13:00
-
*Assuming* that it *is* unicode that has been encoded, (and what makes you believe that?), I do not believe there is any way of determining how it was encoded other than trying to decode it with different encodings (e.g. 'utf-8', 'utf-16', etc.) until you find one that doesn't cause a decoding error. – Booboo Mar 19 '21 at 13:05
-
actually i retrieved a base64 cookie from an old destroyed whatsapp session and I decoded it. Now I ended up with this blob of bytes and wanna print what it actually means. – Lakshaya U. Mar 19 '21 at 13:05
-
You could just loop over all available encoding in https://docs.python.org/3/library/codecs.html#standard-encodings , and try each of them to see which ones don't fail and make sense. – 9769953 Mar 19 '21 at 13:07
-
So is there any class or list of all encodings in the python library itself so that I can try ```for enc in encs: try: string.decode(enc)``` – Lakshaya U. Mar 19 '21 at 13:07
-
2You can copy the list from [this answer](https://stackoverflow.com/a/25584253/9769953). – 9769953 Mar 19 '21 at 13:19
1 Answers
My earlier comment to your question, which is partly accurate and partly in error. From the documentation of Standard Encodings:
Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences.
So you should try to decode with 'utf-8-sig' (for the general case in which a Byte Order Mark or BOM might be present as the first 3 bytes -- which is not the case for your example so you could just use 'utf-8'). But if that fails, you are not guaranteed knowing what encoding was used using trial-and-error decoding because, according to the above documentation, an attempt at decoding with another codec could succeed (and possibly give you garbage). If the `utf-8' decoding succeeds, it is probably the encoding that was used. See below.
s = 'abcde'
print(s.encode('utf-32').decode('utf-16'))
print(s.encode('cp500').decode('latin-1'))
Prints:
a b c d e
�����
Of course, a 'utf-8' encoding will also successfully decode a string that was encoded with the 'ascii' codec, so there is that level of indeterminacy.

- 166,664
- 26
- 169
- 251

- 38,656
- 3
- 37
- 60