I cant believe what im seeing here! I have a normal, basic html form (havent changed the enctype), if someone puts a strange japanese character in the field and posts the form then in my database it is saving an HTML encoded version of the character. I am not processing the string at all except with a Trim(). Using classic ASP (not out of choice i might add!). I have a feeling this might have something to do with utf-8/encoding but ive tried messing around with the meta tag and content type and been unable to get the character to come through properly. To make things harder i dont seem to be able to get classic ASP debugging in VS express 2010. Any comments appreciated :)
-
I might have a feelingthis might have something to do with classic ASP ;) Can you give us a demo URL or at least the first few lines in the HTML source code and the `Content-Type` HTTP header? – phihag Sep 17 '11 at 09:47
-
wow 18 seconds! i had heard about this site but didnt think it would be that quick! :) Thanks for your response, im not at work at the moment so i cant, plus its an admin sys so cant provide a URL. If i did have the code here and i showed it, all you would see is the normal start of an asp page, we have done nothing to set the content type or encoding - the opening tag has nothing about encoding either, and the – Richard Sep 17 '11 at 09:54
-
Don't get too excited, that wasn't an answer yet ;) If you don't include *any* form of encoding specification, browsers will guess, and the result of that guessing is unpredictable. If possible, just add `` directly after `` and use UTF-8 everywhere. – phihag Sep 17 '11 at 09:59
-
hmm i tried something "like" that yesterday from something i found off the web - it may not have been "exactly" that though so i will note that down and try it first thing Monday. Thanks very much, i'll post an update Monday :) – Richard Sep 17 '11 at 10:01
-
Ok, update the question on Monday then. One more thing: If you want to notify me of changes, post a comment starting with `@phihag` (not necessary when you post directly under this one). Have a nice weekend. – phihag Sep 17 '11 at 10:04
-
@phihag Hope you had a nice weekend too :) I have put in the HTML meta tag that you suggested, and i have also tried various combinations of Response.ContentType = "text/html; charset=utf-8" and Response.CharSet = "charset=utf-8", but alas my browser still thinks the page is ISO-8859-1 not UTF-8 :( Thanks for your suggestions though, i feel like i know what im trying to achieve now, there must be something in my IIS/ASP setup that is sending responses in that encoding and not letting me change it, so i at least have a goal i can work towards now. Thanks buddy :) – Richard Sep 19 '11 at 08:27
-
@phihag - actually hold on that, this site i have inherited us using frames and actually that frame is in UTF-8 :) Ill post back later... – Richard Sep 19 '11 at 08:32
-
@phihag - hmmm, well i have made my admin system use utf8 and done the same to the public facing web site which displays the text that i configure via the admin system. Suddenly i have a whole load of funny chars everywhere (i think they were probably originally all currency symbols and things like that). So now im not even sure i can go this route as we have loads of content in the live system which i can no time to update or resave as utf8. Thanks for your help on this, i think i am going to have to do something really really lame here and go back to... – Richard Sep 19 '11 at 09:11
-
@phihag - ...not using server.HTMLEncode on my strings and instead writing my own encode function which replaces ampersands as a basic, and then any other chars that we get complaints about! Thats was the browsers were somehow working out some of the symbols on their own. WOuld really like to use server.HTMLencode on every dynamic string im outputting but doesnt seem like i will be able to even though im sure that is the right things to do :( – Richard Sep 19 '11 at 09:13
-
@phihag - gees my spelling and grammar was terrible on those posts, hope you understood what i was babbling about! – Richard Sep 19 '11 at 09:15
-
I'm really sorry, I should have seen the problem right away and not bored you with my repeated questions. Answering ... – phihag Sep 19 '11 at 17:10
1 Answers
As you can see in this demo and read in the standard (4.10.22.6.4.2), characters that are not supported by the selected encoding (such as Japanese ones in an ISO8859-* or cp1252 encoding) are encoded as HTML entities.
If you are fine with incorrectly handling user input that contains html entities in the clear, you can replace all numeric HTML entities in the user input with the corresponding Unicode character (however, doing so in ASP is hard since there is no inverse function to Server.HTMLEncode
and Unicode support is pretty much nonexistent in the first place.
As an alternative, use UTF-8 (and/or a web development platform from this millennium) and all these problems go away. However, since that may not be an option, you may want the to unescape the HTML entities in different programs, for example with HttpUtility.HtmlDecode
in C#, html_entity_decode
in PHP, or HTMLParser.unescape
in Python.