0

From RFC-3986, section 2.5:

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

So here What is the proper way to URL encode Unicode characters? people assert that non-ASCII symbols in IRI should be converted to UTF-8 first before percent encoding them.

But I found one sample educational web form with application/x-www-form-urlencoded Content-Type and I tried to fill it with some non-ASCII symbols using four browsers (Firefox, Chrome Opera, IE) and looked what POST-queries I get in wireshark. It turned out that the encoding of %H1H2%H3H4...%HkHk+1 symbols is that of the form page when submitting the form.

So for the letter 'Ж', if the form page encoding is set to UTF-8, I get %0D96 but, if I switch to 8-bit Windows-1251, I get %C6 and if I switch to CP-1252 I get %26%231046 where %26 is &, %23 is # and thus, I get xml Unicode number of 'Ж': &#1046, as there is no such a letter in CP-1252.

So my question is why browsers do not convert IRIs to UTF-8 first though it seems like the URL RFC requires it?

Maybe, this is because http:// is an old URI-scheme? From https://en.wikipedia.org/wiki/Percent-encoding:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

So it's said: URI schemes introduced before this date are not affected. But it seems like a lame explanation.

Also, here https://unspecified.wordpress.com/2008/07/08/browser-uri-encoding-the-best-we-can-do/ one person discovered the same problem as mine and the person tries to explain it the way that this is all about vague HTML specification. But I still can't understand how HTML-standard does come in here. The request is made by browser anyway and browser should generate proper URIs.

Thank you for attention.

Community
  • 1
  • 1
JenyaKh
  • 2,040
  • 17
  • 25
  • 1
    The [URI](https://tools.ietf.org/html/rfc3986), [IRI](https://tools.ietf.org/html/rfc3987), and [`application/x-www-form-urlencoded`](http://www.w3.org/TR/html5/forms.html#url-encoded-form-data) specs differ in how Unicode is encoded. An HTML5 webform submitted in `application/x-www-form-urlencoded` format uses UTF-8 if another charset is not requested and the HTML's charset is not ASCII-compatible. URI/IRI don't apply when encoding a webform (the `urlencoded` portion of the typename is a little misleading). The encoded webform data is URI/IRI-compatible, so no need to re-encode to their rules – Remy Lebeau Oct 21 '16 at 01:05
  • @RemyLebeau, thank you for the comment. Could you explain some points? 1. "another charset is not requested': you mean accept-charset html-attribute? 2. "URI/IRI don't apply when encoding a webform (the urlencoded portion of the typename is a little misleading). The encoded webform data is URI/IRI-compatible, so no need to re-encode to their rules" : Sorry, but I failed to understand the part of your comment. A string like param1=value1&param2=value2 is formed as URI with browser. So it should be made according to URI-specification shouldn't it be? – JenyaKh Oct 21 '16 at 03:28
  • 1
    read the link in my previous comment, it explains the exact rules. `accept-charset` is one place an encoding charset is looked for, but it is not the only place. And a webform is encoded regardless of how it will be transmitted afterwards. The encoded format uses only ASCII characters and is compatible with the rules of a URI/IRI query string. – Remy Lebeau Oct 21 '16 at 03:35
  • @RemyLebeau, okay, I've got it. Thank you! But I checked this for HTML4 and discovered that there are no such rules concerning encoding there: https://www.w3.org/TR/1998/REC-html40-19980424/interact/forms.html#h-17.13.4.1 (little info about application/x-www-form-urlencoded there comparing to HTML5). So if a page was created as HTML4 form I can't rely on anything and the encoding can be chosen another way as there is nothing about it in HTML4-spec? – JenyaKh Oct 21 '16 at 03:48
  • HTML4 will probably just use the HTML's charset as the webform submission charset. – Remy Lebeau Oct 25 '16 at 03:02

0 Answers0