How to determine the type of encoding of a particular string?

Question

I tried looking at various SO answers but found nothing constructive enough. What I need to know is what kind of an encoding format is this \x86\x9cG<!\xd9F@\xb4\n\xd6\xd4(\x9cb\xfe.

Do I use any online tool? Do I write a code in python? Any help would be really good.

I want to know what kind of an encoding format is the above string. That's all.

Are you using Python 2 or 3? String handling is very different in the two versions. — Two-Bit Alchemist, Mar 17 '15 at 15:29
No it's not a "possible duplicate" of any means. Compare the two questions before trying to blatantly close questions. — HackCode, Mar 17 '15 at 15:47
Maybe it's not even an "encoding"... Perhaps it's a byte array that was read from a compressed/encrypted file or something... It's short enough that, statistically speaking, it may be difficult to reliably determine the format (at least unambiguously - you might get several matches) - 16 bytes by my count, but I could be off by a couple... — twalberg, Mar 17 '15 at 16:44
@kaushaya Are you kidding? The question may not be the same because the encoding possibilities are narrowed down in advance, but the top answer there shows you how to guess the possible encodings of a byte string in Python 2, and in fact recommends the same library as the only answer you've gotten here. If your question gets closed, it's just on hold until you edit it -- in this case, explain why the answer doesn't help you -- and it's only got one close vote (mine) anyway. Meanwhile you haven't gotten any better answers... — Two-Bit Alchemist, Mar 17 '15 at 17:24

score 3 · Answer 1 · answered Mar 17 '15 at 15:38

3

In my experience enca commandline tool is pretty good at guessing encoding correctly:

http://linux.die.net/man/1/enca

In Python, there's chardet:

https://github.com/chardet/chardet

answered Mar 17 '15 at 15:38

LetMeSOThat4U

6,470
10
53
93

Aaron · Answer 2 · 2015-03-18T00:40:38.937

I just think of a way, you can decode the string in every possible encodings.

The following encodings are borrowed from Python's Standard Encodings

code_list = ["ascii", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500",
 "cp720", "cp737", "cp775", "cp850", "cp852", "cp855", "cp856", "cp857", "cp858",
 "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874",
 "cp875", "cp932", "cp949", "cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251",
 "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258", "euc_jp",
 "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp",
 "iso2022_jp_1", "iso2022_jp_2", "iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext",
 "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5", "iso8859_6",
 "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14",
 "iso8859_15", "iso8859_16", "johab", "koi8_r", "koi8_u", "mac_cyrillic", "mac_greek",
 "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis",
 "shift_jis_2004", "shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16",
 "utf_16_be", "utf_16_le", "utf_7", "utf_8", "utf_8_sig", "idna", "mbcs", "palmos",
 "punycode", "raw_unicode_escape", "rot_13", "undefined", "unicode_escape",
 "unicode_internal", "base64_codec", "bz2_codec", "hex_codec", "quopri_codec",
 "string_escape", "uu_codec", "zlib_codec"]

s = '\x86\x9cG<!\xd9F@\xb4\n\xd6\xd4(\x9cb\xfe'


for i in code_list:
    try:
        print 'Using {0} to decode......{1:<30}'.format(i,s.decode(i).encode('utf-8'))
    except Exception as e:
#         pass
        print e

I tried this, since it seemed sensible. Sadly, using each of these, you mainly get nonsense strings or exceptions, so I think @kaushaya might have to dig some more. — Kyle_S-C, Mar 18 '15 at 00:46

How to determine the type of encoding of a particular string?

2 Answers2