Need help identifying type of UTF Encoding

Question

I'm having a hard time trying to figure out the type of unicode that i need to convert to pass data for post request. Mostly would be chinese characters.

Example String:

的事故事务院治党派驻地是不是

Expected Unicode: %u7684%u4E8B%u6545%u4E8B%u52A1%u9662%u6CBB%u515A%u6D3E%u9A7B%u5730%u662F%u4E0D%u662F

Tried to encode to UTF16-BE: %76%84%4E%8B%65%45%4E%8B%52%A1%5C%40%5C%40%95%7F%67%1F%8D%27%7B%49%5F%85%62%08%59%1A

Encoded text in UTF-16: %FF%FE%84%76%8B%4E%45%65%8B%4E%A1%52%62%96%BB%6C%5A%51%3E%6D%7B%9A%30%57%2F%66%0D%4E%2F%66

Encoded text in UTF-8: %E7%9A%84%E4%BA%8B%E6%95%85%E4%BA%8B%E5%8A%A1%E9%99%A2%E6%B2%BB%E5%85%9A%E6%B4%BE%E9%A9%BB%E5%9C%B0%E6%98%AF%E4%B8%8D%E6%98%AF

As you can see, UTF16-BE is the closest, but it only takes 2 bytes and there should be an additional %u in front of every character as shown in the expected unicode.

I've been using URLEncoder method to get the encoded text, with the standard charset encodings but it doesn't seem to return the expected unicode.

Code:

String text = "的事故事务院治党派驻地是不是";
URLEncoder.encode(text, "UTF-16BE");

Possible duplicate of [How to check the charset of string in Java?](https://stackoverflow.com/questions/11497902/how-to-check-the-charset-of-string-in-java) — Dziugas, Jul 04 '17 at 17:16
@Kayaman this is the unicode value that i grabbed while sniffing the post request in chrome console. — fadhli-sulaimi, Jul 04 '17 at 17:25
If the server decodes the data properly, then it looks like you want to use UTF-16BE encoding. It doesn't matter what you sniffed. — Kayaman, Jul 04 '17 at 17:37

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

As Kayaman said in a comment: Your expectation is wrong.

That is because %uNNNN is not a valid URL encoding of Unicode text. As Wikipedia says it:

There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a UTF-16 code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C.

So unless your server is expected non-standard input, your expectation is wrong.

Instead, use UTF-8. As Wikipedia says it:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

That is however for sending data in a URL, e.g. as part of a GET.

For sending text data as part of a application/x-www-form-urlencoded encoded POST, see the HTML5 documentation:

If the form element has an accept-charset attribute, let the selected character encoding be the result of picking an encoding for the form.

Otherwise, if the form element has no accept-charset attribute, but the document's character encoding is an ASCII-compatible character encoding, then that is the selected character encoding.

Otherwise, let the selected character encoding be UTF-8.

Since most web pages ("the document") are presented in UTF-8 these days, that would likely mean UTF-8.

I see, thanks. So i have to set the request property to form-urlencoded for post request. For this case, if i set the charset to accept UTF-8, i can simply send UTF-8 data. Is this what you mean? — fadhli-sulaimi, Jul 05 '17 at 04:37

score 0 · Answer 2 · answered Jul 05 '17 at 09:39

I think that you are thinking too far. The encoding of a text doesn't need to "resemble" in any way the string of Unicode code points of this text. These are two different things.

To send the string 的事故事务院治党派驻地是不是 in a POST request, just write the entire POST request and encode it with UTF-8, and the resulting bytes are what is sent as the body of the POST request to the server.

As pointed out by @Andreas, UTF-8 is the default encoding of HTML5, so it's not even necessary to set the accept-charset attribute, because the server will automatically use UTF-8 to decode the body of your request, if accept-charset is not set.

Need help identifying type of UTF Encoding

2 Answers2