From RFC-3986, section 2.5:
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".
So here What is the proper way to URL encode Unicode characters? people assert that non-ASCII symbols in IRI should be converted to UTF-8 first before percent encoding them.
But I found one sample educational web form with application/x-www-form-urlencoded Content-Type and I tried to fill it with some non-ASCII symbols using four browsers (Firefox, Chrome Opera, IE) and looked what POST-queries I get in wireshark. It turned out that the encoding of %H1H2%H3H4...%HkHk+1 symbols is that of the form page when submitting the form.
So for the letter 'Ж', if the form page encoding is set to UTF-8, I get %0D96 but, if I switch to 8-bit Windows-1251, I get %C6 and if I switch to CP-1252 I get %26%231046 where %26 is &, %23 is # and thus, I get xml Unicode number of 'Ж': Ж, as there is no such a letter in CP-1252.
So my question is why browsers do not convert IRIs to UTF-8 first though it seems like the URL RFC requires it?
Maybe, this is because http:// is an old URI-scheme? From https://en.wikipedia.org/wiki/Percent-encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
So it's said: URI schemes introduced before this date are not affected. But it seems like a lame explanation.
Also, here https://unspecified.wordpress.com/2008/07/08/browser-uri-encoding-the-best-we-can-do/ one person discovered the same problem as mine and the person tries to explain it the way that this is all about vague HTML specification. But I still can't understand how HTML-standard does come in here. The request is made by browser anyway and browser should generate proper URIs.
Thank you for attention.