From what I've read it seems like a browser must send the x-www-form-urlencoded data in a request in the character set of the form from which the request was generated.
So then, Why do some websites, such as http://www.railscasts.com, add ?utf8=%E2%9C%93 (that's ?utf8=✓) to forms? Is this a hack that makes it easier to do something? The character set of that page is UTF-8 already (I checked the headers), so can't it guarantee that the browser will be sending UTF-8? What browsers don't do this? According to w3schools, all major browsers implement accept-charset from forms:
<form accept-charset="UTF-8">
so why isn't that used instead? Or just nothing at all (since the response specifies UTF-8)?
I did some investigating:
In a UTF-8 page, it appears as though searching for 木 (U+6728) gives:
search:%E6%9C%A8
So it's using percent-encoding, which appears to be byte-by-byte encoding hex encoding of whatever the underlying character set is. Well, that definitely works, because this place says that's the UTF-8 encoding. That's good, but it's the simple case, where I'm trying to send UTF-8 data to a UTF-8 page.
Now let's say that I have an ISO-8859-1 page that has a form on it. It's a GET form, and I decide to enter the same 木
for a field. Well, that definitely isn't ISO-8859-1. So Chrome encodes it to
search:木
which is then percent-encoded appropriately to %26%2326408%3B
. I verified that IE 8 does the same thing in Windows. So what's the point of the UTF-8 hack?
Related question: Detecting the character encoding of an HTTP POST request