Why do HTML entity names with dec < 255 not require semicolon?

Question

In a plain HTML document &pound (dec 163) renders as £ without needing the ;, whereas &oelig (dec 339) will only render a œ with the semicolon. It seems that every html entity with a decimal value under 255 will render without needing the semicolon, both in FireFox and Chrome.

What gives?

Jukka K. Korpela · Accepted Answer · 2013-09-09T11:11:34.180

The reason is that historically the semicolon has been optional when an entity reference (or a character reference) is not immediately followed by a name character. So &pound? is OK since ? is not a name character (i.e., a character allowed in names), but &pound4 is not, since 4 is a name character, making pound4 the entity name (which is undefined in HTML, but might become defined some day). This rule is part of SGML legacy in HTML, one of the few things where browsers actually applied specialties of SGML.

It has, however, always been regarded as good practice to terminate entity references by a semicolon. XML, and hence XHTML, makes it even formally mandatory.

This is why current browser practices allow omission of semicolons as in “classic” HTML, but only for the limited set of character references denoting ISO Latin 1 characters, i.e. characters with Unicode number less than 256 in decimal (FF in hexadecimal). This was the original set of entity references, and therefore such references have widely been used without semicolon. So the practices are a compromise: they want to encourage into using the recommendable notation but not invalidate a bulk of old pages, still less to make browsers fail to render them properly.

The HTML5 drafts have had various positions on this, but e.g. HTML5 CR from 6 August 2013 requires the semicolon in all cases even in HTML syntax. Lack of semicolon is defined as a parse error, which means that error handling is well-defined (the entity shall be recognized), but browsers may still stop parsing at first parse error!

Do you have a reference for the Latin 1 special-case being "the current rules"? Both the WHATWG standard and the W3C HTML5 draft seem to say the semi-colon is mandatory, as quoted in my answer. — IMSoP, Sep 09 '13 at 08:42
@IMSoP, good catch. I’ve edited my answer accordingly. What I describe is common practice in modern browsers, was the text in some earlier HTML5 draft, and is reflected in http://validator.w3.org (which reports both `&pound` and `&oelig` as errors, but differently: in the former case, it’s an error in the syntax of the reference, in the latter case, the reference is reported as not recognized). — Jukka K. Korpela, Sep 09 '13 at 11:15
Aha! That's the problem with these "Living" and "Draft" standards, I guess, you have to check the text hasn't changed since you last read it. It would certainly explain why some non-semicolon forms are listed in [this table](http://www.w3.org/html/wg/drafts/html/CR/syntax.html#named-character-references), and then declared invalid [elsewhere in the standard](http://www.w3.org/html/wg/drafts/html/CR/syntax.html#character-references). For reference, the parsing rules [are defined here in the W3C draft](http://www.w3.org/TR/html5/syntax.html#consume-a-character-reference). — IMSoP, Sep 09 '13 at 11:53

IMSoP · Answer 2 · 2013-09-08T23:20:43.727

Firstly, this is entirely up to how forgiving the browser/rendering engine wants to be, and is not a property of HTML: all entities must end in a semi-colon, or you have invalid syntax. (The WHATWG "HTML Living Standard" confusingly considers this semi-colon to be part of the name, making it seem optional in the Devloper Edition but the full Standard text/W3C HTML5 draft is clearer: "The name must be one that is terminated by a U+003B SEMICOLON character (;).")

Secondly, referring to a character as having a "decimal value" is ambiguous at best. 163 and 339 are the "code points" of those characters in Unicode, which would normally be expressed in hexadecimal. Other encodings would have different positions for those characters, which could also be expressed as a "decimal value" if you wanted.

Thirdly, my guess is that it is not so much to do with where they come in a particular encoding sequence, but how common they are - the full list is extremely long (→WHATWG/→W3C). There is a trade-off to be made in interpreting such invalid sequences, since a URL might contain unescaped ampersands, which then in turn look like unterminated entities (e.g. http://example.com/foo?bar=rab&oelig=gileo). So browsers are trying to tread that fine line and guess which mistake was probably made in a particular case.

The HTML 4.01 specification, in section [Character references](http://www.w3.org/TR/REC-html40/charset.html#h-5.3), says: “In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag)”. And HTML 4.01 normatively cites the SGML standard. — Jukka K. Korpela, Sep 09 '13 at 06:26
Good point. Browsers are more forgiving than SGML, though (and probably always have been); for instance Firefox treats `foo&poundbar` as containing the £ entity. That could be according to a rule in the WHATWG/HTML5 standard, but I couldn't find such a rule. — IMSoP, Sep 09 '13 at 08:46

Why do HTML entity names with dec < 255 not require semicolon?

2 Answers2

Linked