Browser is HTML Encoding a character before sending it?

Question

I cant believe what im seeing here! I have a normal, basic html form (havent changed the enctype), if someone puts a strange japanese character in the field and posts the form then in my database it is saving an HTML encoded version of the character. I am not processing the string at all except with a Trim(). Using classic ASP (not out of choice i might add!). I have a feeling this might have something to do with utf-8/encoding but ive tried messing around with the meta tag and content type and been unable to get the character to come through properly. To make things harder i dont seem to be able to get classic ASP debugging in VS express 2010. Any comments appreciated :)

I might have a feelingthis might have something to do with classic ASP ;) Can you give us a demo URL or at least the first few lines in the HTML source code and the `Content-Type` HTTP header? — phihag, Sep 17 '11 at 09:47
wow 18 seconds! i had heard about this site but didnt think it would be that quick! :) Thanks for your response, im not at work at the moment so i cant, plus its an admin sys so cant provide a URL. If i did have the code here and i showed it, all you would see is the normal start of an asp page, we have done nothing to set the content type or encoding - the opening tag has nothing about encoding either, and the
tag just has a name, no enctype. Your post has made me think that i need to use something like sam spade to check out the content type and various headers tho so thanks :) — Richard, Sep 17 '11 at 09:54
Don't get too excited, that wasn't an answer yet ;) If you don't include *any* form of encoding specification, browsers will guess, and the result of that guessing is unpredictable. If possible, just add `` directly after `` and use UTF-8 everywhere. — phihag, Sep 17 '11 at 09:59
hmm i tried something "like" that yesterday from something i found off the web - it may not have been "exactly" that though so i will note that down and try it first thing Monday. Thanks very much, i'll post an update Monday :) — Richard, Sep 17 '11 at 10:01
Ok, update the question on Monday then. One more thing: If you want to notify me of changes, post a comment starting with `@phihag` (not necessary when you post directly under this one). Have a nice weekend. — phihag, Sep 17 '11 at 10:04
@phihag Hope you had a nice weekend too :) I have put in the HTML meta tag that you suggested, and i have also tried various combinations of Response.ContentType = "text/html; charset=utf-8" and Response.CharSet = "charset=utf-8", but alas my browser still thinks the page is ISO-8859-1 not UTF-8 :( Thanks for your suggestions though, i feel like i know what im trying to achieve now, there must be something in my IIS/ASP setup that is sending responses in that encoding and not letting me change it, so i at least have a goal i can work towards now. Thanks buddy :) — Richard, Sep 19 '11 at 08:27
@phihag - actually hold on that, this site i have inherited us using frames and actually that frame is in UTF-8 :) Ill post back later... — Richard, Sep 19 '11 at 08:32
@phihag - hmmm, well i have made my admin system use utf8 and done the same to the public facing web site which displays the text that i configure via the admin system. Suddenly i have a whole load of funny chars everywhere (i think they were probably originally all currency symbols and things like that). So now im not even sure i can go this route as we have loads of content in the live system which i can no time to update or resave as utf8. Thanks for your help on this, i think i am going to have to do something really really lame here and go back to... — Richard, Sep 19 '11 at 09:11
@phihag - ...not using server.HTMLEncode on my strings and instead writing my own encode function which replaces ampersands as a basic, and then any other chars that we get complaints about! Thats was the browsers were somehow working out some of the symbols on their own. WOuld really like to use server.HTMLencode on every dynamic string im outputting but doesnt seem like i will be able to even though im sure that is the right things to do :( — Richard, Sep 19 '11 at 09:13
@phihag - gees my spelling and grammar was terrible on those posts, hope you understood what i was babbling about! — Richard, Sep 19 '11 at 09:15
I'm really sorry, I should have seen the problem right away and not bored you with my repeated questions. Answering ... — phihag, Sep 19 '11 at 17:10

score 0 · Answer 1 · edited May 23 '17 at 10:24

As you can see in this demo and read in the standard (4.10.22.6.4.2), characters that are not supported by the selected encoding (such as Japanese ones in an ISO8859-* or cp1252 encoding) are encoded as HTML entities.

If you are fine with incorrectly handling user input that contains html entities in the clear, you can replace all numeric HTML entities in the user input with the corresponding Unicode character (however, doing so in ASP is hard since there is no inverse function to Server.HTMLEncode and Unicode support is pretty much nonexistent in the first place.

As an alternative, use UTF-8 (and/or a web development platform from this millennium) and all these problems go away. However, since that may not be an option, you may want the to unescape the HTML entities in different programs, for example with HttpUtility.HtmlDecode in C#, html_entity_decode in PHP, or HTMLParser.unescape in Python.

Browser is HTML Encoding a character before sending it?

1 Answers1