15

I want to standardise on UTF8 on our web site. All our databases and internet stuff is in UTF8. All our web servers are sending the charset=utf-8 HTTP header. However I've discovered that by changing the encoding in my Firefox (View -> Character Encoding) to something else I can enter a Latin-9 character into a form and PHP just treats them as malformed UTF8.

How much do I have to worry about that? Is it possible for the user's web browser to override the UTF8 charset header and send non-UTF8?

Update: Several people have suggested accept-charset on the individual forms. However I'd rather not have to change every web form. Assuming I can control the HTTP content-type header, and it's set to UTF8, do I have anything to worry about?

Paul D. Waite
  • 96,640
  • 56
  • 199
  • 270
Amandasaurus
  • 58,203
  • 71
  • 188
  • 248
  • "All our databases and internet stuff" - all your internet stuff are belong to us. – Paul D. Waite Feb 13 '13 at 15:47
  • As per the accepted answer to [this question](http://stackoverflow.com/questions/3719974/is-there-any-benefit-to-adding-accept-charset-utf-8-to-html-forms-if-the-page), `accept-charset` will solve the specific problem you've discovered: i.e. if the user tells their browser to interpret the page as non-UTF-8, `accept-charset` should make the browser submit form content as UTF-8 despite that. Whether guarding against that particular situation is worth adding the attribute to all your forms, well, that's your judgment call. – Paul D. Waite Feb 13 '13 at 15:52
  • 1
    make sure your page is really utf-8; in browser debugger, look for the Content-Type header that's sent. Also, in JS console, evaluate document.charset; should return some spelling of utf8. 'windows-1252' maybe means the browser doesn't recognize the encoding sent. – OsamaBinLogin Jan 20 '16 at 01:39

4 Answers4

12

Is it possible for the user's web browser to override the utf8 charset header and send non-UTF8?

Of course. You don't control the client, and the client can do whatever it wants, including letting users override the normal encodings and cause junk (or what passes for junk) to be sent to your server.

That said, it sounds like you've taken most of important steps here. Your actual HTML document is UTF-8 encoded and explicitly marked as such, which means that browsers will generally default to submitting forms in that encoding also. (Note that the HTML spec doesn't require this. Specifying the accept-charset on the form explicitly is the only spec-compliant guarantee.) I suspect that this will work as expected in all modern browsers, and you could test this easily.

On the server, your job is always to validate your input to the extent that it's important to your service. Although the vast majority of your users will be benevolent and using modern standard browsers, the HTTP protocol is open, and both wacky users and malicious hackers are out there, and both can throw any kind of data they want at you. Make sure that you're not making assumptions about data encodings when security or authenticated data is involved, and sanitize this stuff before you shove it into databases.

Ben Zotto
  • 70,108
  • 23
  • 141
  • 204
4

I think the best solution is to convert to UTF-8 and handle any non-UTF-8 characters when the user submits data. As noted above, the accept-charset="UTF-8" will not guarantee that data is UTF-8. And, if you have to change the forms all over your site then it is not a good solution.

So, processing the input upon submission might be a better way.

B Seven
  • 44,484
  • 66
  • 240
  • 385
3

Try adding the accept-charset attribute to your form elements.

Lars Haugseth
  • 14,721
  • 2
  • 45
  • 49
2

Place an accept-charset="UTF-8" element on the form element, that will cause the form post to be UTF-8 despite the encoding of the page content.

AnthonyWJones
  • 187,081
  • 35
  • 232
  • 306