-1

I have a string like "%E6%B1%82%E5%8A%A9".

My question is how i can know it's encoded by "UTF-8" or not. It also seems like GBK(or GB2312) encoding.

Thank you.

thomaslee
  • 407
  • 1
  • 7
  • 21
  • 1
    "abc" can be encoded into UTF-8, UTF-16 and if isn't encoded ( ASCII encoding) still the same. => for the same string maybe you can have multiple valid encoding. –  Dec 27 '12 at 03:13

2 Answers2

5

This is not UTF-8 encoding, it is called Percent or URL Encoding.

You can decode it in Java using URLDecoder API.

mvp
  • 111,019
  • 13
  • 122
  • 148
  • Thanks for reply. URLDecoder.decode transfers %XX to a denary number. A encoding should be given by the 2nd arg of decode function. For the above string, URLDecoder.decode(str, "utf-8") will work well. But for another string like "%C4%E3%BA%C3", it returns messy code and should set the encoding to "gb2312". – thomaslee Dec 27 '12 at 03:54
  • 1
    What you can do then is to manually transform your percent encoded string into byte array, and then use `juniversalchardet` to guess actual encoding and transform it into `UTF-8` (see more here http://stackoverflow.com/a/1678810/1734130 ). But, this is really messy and **extremely** unreliable with string of only 4 bytes long – mvp Dec 27 '12 at 04:04
  • I haved realized a function to transform percent encoded string into byte array. What puzzled me is which encoding should be given for it. I will try `juniversalchardet`. Thank you! – thomaslee Dec 27 '12 at 04:19
1

There is no way to detect the encoding of a stream of bytes with 100% accuracy, still there are libraries capable of making quite effective educated guesses. Among them I would recommend juniversalchardet.

Anthony Accioly
  • 21,918
  • 9
  • 70
  • 118
  • unfortunately, in this case `juniversalchardet` will detect this text as ASCII or UTF-8, which is not really helping to get encoded text out – mvp Dec 27 '12 at 03:23