Determine encoding of an item with its start byte

Question

I recently came across this string: b'\xd5\xa3Lk\xd4\xad\xaeH\xb8\xae\xab\xd8EL3\xd1RR\x17\x0c\xea~\xfa\xd0\xc9\xfeJ\x9aq\xd0\xc57\xfd\xfa\x1d}\x8f\x99?*\xef\x88\x1e\x99\x8d\x81t`1\x91\xebh\xc5\x9d\xa7\xa5\x8e\xb9X' and I wanna decode it. Now I know this is possible with python with string.decode() but it requires an encoding. How can I determine the encoding to decode this string?

Decode to what? Printable text? What if it's an actual binary blob and there's no "hidden" text or encoding to determine? — Sergio Tulentsev, Mar 19 '21 at 13:00
*Assuming* that it *is* unicode that has been encoded, (and what makes you believe that?), I do not believe there is any way of determining how it was encoded other than trying to decode it with different encodings (e.g. 'utf-8', 'utf-16', etc.) until you find one that doesn't cause a decoding error. — Booboo, Mar 19 '21 at 13:05
actually i retrieved a base64 cookie from an old destroyed whatsapp session and I decoded it. Now I ended up with this blob of bytes and wanna print what it actually means. — Lakshaya U., Mar 19 '21 at 13:05
You could just loop over all available encoding in https://docs.python.org/3/library/codecs.html#standard-encodings , and try each of them to see which ones don't fail and make sense. — 9769953, Mar 19 '21 at 13:07
So is there any class or list of all encodings in the python library itself so that I can try ```for enc in encs: try: string.decode(enc)``` — Lakshaya U., Mar 19 '21 at 13:07
You can copy the list from [this answer](https://stackoverflow.com/a/25584253/9769953). — 9769953, Mar 19 '21 at 13:19

score 2 · Accepted Answer · edited Mar 19 '21 at 15:12

My earlier comment to your question, which is partly accurate and partly in error. From the documentation of Standard Encodings:

Without external information it’s impossible to reliably determine which encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that’s not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn’t allow arbitrary byte sequences.

So you should try to decode with 'utf-8-sig' (for the general case in which a Byte Order Mark or BOM might be present as the first 3 bytes -- which is not the case for your example so you could just use 'utf-8'). But if that fails, you are not guaranteed knowing what encoding was used using trial-and-error decoding because, according to the above documentation, an attempt at decoding with another codec could succeed (and possibly give you garbage). If the `utf-8' decoding succeeds, it is probably the encoding that was used. See below.

s = 'abcde'
print(s.encode('utf-32').decode('utf-16'))
print(s.encode('cp500').decode('latin-1'))

Prints:

 a b c d e
�����

Of course, a 'utf-8' encoding will also successfully decode a string that was encoded with the 'ascii' codec, so there is that level of indeterminacy.

Got it. But in the end its just a big blob of nonsense :) – Lakshaya U. Mar 20 '21 at 04:40 — Lakshaya U., Mar 20 '21 at 04:40

Determine encoding of an item with its start byte

1 Answers1

Linked