Accurately unescape HTML entities in javascript

Question

In javascript, I need to take a string and HTML un-escape it.

This question over here asks the same question, and the most popular answer involves populating a temporary div.

I've used this as well, but I think I've found a bug.

Simple example, correct behavior

If you have this string: Cats>Dogs

Unescaped, it should be: Cats>Dogs

Malformed example, wrong behavior

If you remove the semicolon and use this instead:Cats&gtDogs

You will get this as a result: Cats>Dogs

Isn't that wrong?

This struck me as odd. From what I understand, an escaped string requires the presence of a terminating semicolon, otherwise it's not escaped. After all, what if I had a store called guitars&amps? For all we know, this company exists but gets no business because it causes null reference exceptions everywhere it has records.

Any ideas on how I could perform escaping while knowingly avoiding escaping when the semicolon is missing? Currently, all I can think to do is perform the unescaping myself.

(The WYSIWYG preview in StackOverflow exhibits a similar unusual behavior, by the way. Try entering &ampgt;, this renders as >!)

I ended up coding up a solution to this problem manually. I was able to narrow down my use case to one that only needed to positively identify simple HTML escapes. — Johnny Kauffman, Aug 30 '14 at 00:43

score 2 · Accepted Answer · answered Nov 08 '13 at 22:45

2

Isn't that wrong?

Successful HTML parsers are tolerant. This is one of the things distinguishing them from, say, XML parsers. They don't necessarily stick to strict rules about markup, for the simple reason that there's a lot of incorrect markup out there. So they try to figure out what the markup is meant to represent. &gtDogs is more likely to mean >Dogs than &gtDogs, so that's what the parser goes with.

answered Nov 08 '13 at 22:45

T.J. Crowder

1,031,962
187
1,923
1,875

I agree that tolerant HTML parsers have their uses. I don't mean to be rude, but that doesn't address the problem. In my situation, I'm hoping to help my users to carefully input exactly what they want. The users' input eventually is used by other systems that I can't guarantee are tolerant. In other words, I want the users to be able to see whether or not they've formatted something "properly". If I were to rely on tolerance here, there is a risk that the other systems won't read read it correctly. – Johnny Kauffman Nov 20 '13 at 16:57
@JohnnyKauffman: To do that, I think you'll have to do the check yourself. It's apparent from your experiments that you can't rely on the browser, because it will try to be tolerant. The [list of named character entities](http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#Character_entity_references_in_HTML) is available and it's straightforward to validate numeric entities. Of course, that's only one small part of the problem. To do serious validation, you may want to look at integrating with [the W3C validator](http://validator.w3.org/). – T.J. Crowder Nov 20 '13 at 17:12
This seems to be the best answer. I've created strict validation code for my specific situation. Luckily, I only check on how users escape their special characters (like """). If someone needed to validate full-blown HTML, using the W3C validation service might be the best option. – Johnny Kauffman Dec 30 '13 at 21:55

Accurately unescape HTML entities in javascript

Simple example, correct behavior

Malformed example, wrong behavior

Isn't that wrong?

1 Answers1