What are the longest and shortest HTML character entity names?

Question

There are a million cheatsheets all around the tubes that enumerate to different levels of comprehension the character entities specified by various versions and specifications of HTML. I don't want to trust any particular one of them, so I figure I'll toss it out here and see if anyone posts a more authoritative answer.

So, let's assume that I want to match any and all character references and entities using a regular expression. I'd start with /&(?:#(?:x[0-9a-f]+|[0-9]+)|[a-z]{???,???});/i. But what would go into ???s? I can think of entities that are two characters long, like lt and gt, but are there any one-letter entities in any specifications of the HTML? Likewise, what is the longest entity? Finally, those are the only three syntaxes for expressing literal characters in HTML aside from just typing them directly, are they not?

Why do you need to specify the length anyway? A simple `+` should do, no? — deceze, Sep 24 '12 at 13:32
Not really... &laksjdlfkjasdlkfjadslkfjasdlkfjasldfkj; will just be rendered verbatim, and is therefore not an entity. — wwaawaw, Sep 24 '12 at 13:36
So will `&foo;` because it's not a defined entity. It's not about the length. — deceze, Sep 24 '12 at 13:38
Good question (but I don't know the answer). Note however, that alpha entities are case sensitive, (e.g. `‡` and `†`) so you'll need to include the uppercase chars in your alpha char class alternative. — ridgerunner, Sep 24 '12 at 13:45
Right, but if I want to match the most narrow and valid set possible without resorting to enumeration then length is important. — wwaawaw, Sep 24 '12 at 13:46
Missed that. _D'oh!_ But the regex may run a smidge faster if you remove the `i` modifier and explicitly specify the uppercase chars in the char classes. — ridgerunner, Sep 24 '12 at 15:05
Should not have been closed. This is an excellent question, and did indeed help future visitors! — Charles Roth, May 07 '18 at 20:15

score 6 · Accepted Answer · answered Sep 24 '12 at 13:44

6

Longest in HTML5 is &CounterClockwiseContourIntegral;, and there are no one-letter names.

But note that named entity references don't work as you think. Some named character references don't end with a semi-colon, so a regex won't cut the mustard.

answered Sep 24 '12 at 13:44

Alohci

78,296
16
112
156

1

Interesting, I wasn't aware of non-semicolon-terminated entities. Do you have an example/reference? – deceze Sep 24 '12 at 13:47
1

Can you provide an example of a non-semicolon ending one? – wwaawaw Sep 24 '12 at 13:48
Out if curiosity, can you add examples and/or links? (And what does any of this have with mustard to do? :-) – tripleee Sep 24 '12 at 13:50
1

`&copy` is the most common. `&shy` is another. There are over one hundred of them. The W3C HTML5 list seems broken at the minute but they should be available on the WHATWG copy. – Alohci Sep 24 '12 at 13:53
List is here: http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references but the non-semi colon ones are mixed in with the ones that do have semi colons. – Alohci Sep 24 '12 at 13:55
Ooops , sorry. Your comments hadn't been showing up when I added mine. Anyway, why wouldn't they make it predictable and give them all `;`s? – wwaawaw Sep 24 '12 at 14:25
1

Browsers have always tried to fix broken markup, and at least one browser - probably netscape when it had the majority market share - decided that if authors forgot the semi-colon then well they'd just fix it for them. Once that happened, web pages came to rely on the behaviour and other browsers had to follow suit, otherwise the pages would look broken in their browsers. HTML5 just documents what has been long standing browser practice. – Alohci Sep 24 '12 at 14:30

score 3 · Answer 2 · answered Sep 24 '12 at 13:43

The HTML5 spec explicitly describes now, what browsers used to do as error correction since the mid-90s: Show the thing verbatim, if it doesn't match a known character reference. Therefore, if you want your regex to work like a browser, you have to copy the browsers behaviour.

That means, you have to test against a complete list of known references, like the one mentioned by Jukka. You can abbreviate the term with clever use of parentheses,

[aeiou]uml

but you need to bake the same knowledge into the regex, that the browser has, in order to get the same result.

Edit: By the way, named entities might also have numbers in them, e.g., &ensp13;.

score 2 · Answer 3 · answered Sep 24 '12 at 13:38

2

Entity names used to have 2 to 7 letters, following SGML tradition, and this is still the case in the HTML 4.01 specification (and XHTML specifications). But HTML5 drafts add a large number of entities, called named character references there, and some of them are fairly long, like EmptyVerySmallSquare. So it would be better to avoid any fixed upper limit – or a lower limit larger than 1.

answered Sep 24 '12 at 13:38

Jukka K. Korpela

195,524
37
270
390

Why not a lower bound of `2`? – wwaawaw Sep 24 '12 at 13:47
1

Because some day someone may add a single-letter entity (at least as browser-specific). And `&a;` is an entity reference by current HTML specifications – just an undefined one. – Jukka K. Korpela Sep 24 '12 at 14:51

What are the longest and shortest HTML character entity names?

3 Answers3

Linked