About JAVA encoding recognition

Question

I have a string like "%E6%B1%82%E5%8A%A9".

My question is how i can know it's encoded by "UTF-8" or not. It also seems like GBK(or GB2312) encoding.

Thank you.

"abc" can be encoded into UTF-8, UTF-16 and if isn't encoded ( ASCII encoding) still the same. => for the same string maybe you can have multiple valid encoding. — , Dec 27 '12 at 03:13

score 5 · Accepted Answer · answered Dec 27 '12 at 03:13

5

This is not UTF-8 encoding, it is called Percent or URL Encoding.

You can decode it in Java using URLDecoder API.

answered Dec 27 '12 at 03:13

mvp

111,019
13
122
148

Thanks for reply. URLDecoder.decode transfers %XX to a denary number. A encoding should be given by the 2nd arg of decode function. For the above string, URLDecoder.decode(str, "utf-8") will work well. But for another string like "%C4%E3%BA%C3", it returns messy code and should set the encoding to "gb2312". – thomaslee Dec 27 '12 at 03:54
1

What you can do then is to manually transform your percent encoded string into byte array, and then use `juniversalchardet` to guess actual encoding and transform it into `UTF-8` (see more here http://stackoverflow.com/a/1678810/1734130 ). But, this is really messy and **extremely** unreliable with string of only 4 bytes long – mvp Dec 27 '12 at 04:04
I haved realized a function to transform percent encoded string into byte array. What puzzled me is which encoding should be given for it. I will try `juniversalchardet`. Thank you! – thomaslee Dec 27 '12 at 04:19

score 1 · Answer 2 · answered Dec 27 '12 at 03:17

1

There is no way to detect the encoding of a stream of bytes with 100% accuracy, still there are libraries capable of making quite effective educated guesses. Among them I would recommend juniversalchardet.

answered Dec 27 '12 at 03:17

Anthony Accioly

21,918
9
70
118

unfortunately, in this case `juniversalchardet` will detect this text as ASCII or UTF-8, which is not really helping to get encoded text out – mvp Dec 27 '12 at 03:23

About JAVA encoding recognition

2 Answers2