2

There are a million cheatsheets all around the tubes that enumerate to different levels of comprehension the character entities specified by various versions and specifications of HTML. I don't want to trust any particular one of them, so I figure I'll toss it out here and see if anyone posts a more authoritative answer.

So, let's assume that I want to match any and all character references and entities using a regular expression. I'd start with /&(?:#(?:x[0-9a-f]+|[0-9]+)|[a-z]{???,???});/i. But what would go into ???s? I can think of entities that are two characters long, like lt and gt, but are there any one-letter entities in any specifications of the HTML? Likewise, what is the longest entity? Finally, those are the only three syntaxes for expressing literal characters in HTML aside from just typing them directly, are they not?

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
wwaawaw
  • 6,867
  • 9
  • 32
  • 42
  • 3
    Why do you need to specify the length anyway? A simple `+` should do, no? – deceze Sep 24 '12 at 13:32
  • 1
    Not really... &laksjdlfkjasdlkfjadslkfjasdlkfjasldfkj; will just be rendered verbatim, and is therefore not an entity. – wwaawaw Sep 24 '12 at 13:36
  • 3
    So will `&foo;` because it's not a defined entity. It's not about the length. – deceze Sep 24 '12 at 13:38
  • 3
    Good question (but I don't know the answer). Note however, that alpha entities are case sensitive, (e.g. `‡` and `†`) so you'll need to include the uppercase chars in your alpha char class alternative. – ridgerunner Sep 24 '12 at 13:45
  • Right, but if I want to match the most narrow and valid set possible without resorting to enumeration then length is important. – wwaawaw Sep 24 '12 at 13:46
  • 1
    @ridgerunner nope, there's an `/i` flag. – wwaawaw Sep 24 '12 at 13:51
  • Missed that. _D'oh!_ But the regex may run a smidge faster if you remove the `i` modifier and explicitly specify the uppercase chars in the char classes. – ridgerunner Sep 24 '12 at 15:05
  • 2
    Should not have been closed. This is an excellent question, and did indeed help future visitors! – Charles Roth May 07 '18 at 20:15

3 Answers3

6

Longest in HTML5 is &CounterClockwiseContourIntegral;, and there are no one-letter names.

But note that named entity references don't work as you think. Some named character references don't end with a semi-colon, so a regex won't cut the mustard.

Alohci
  • 78,296
  • 16
  • 112
  • 156
  • 1
    Interesting, I wasn't aware of non-semicolon-terminated entities. Do you have an example/reference? – deceze Sep 24 '12 at 13:47
  • 1
    Can you provide an example of a non-semicolon ending one? – wwaawaw Sep 24 '12 at 13:48
  • Out if curiosity, can you add examples and/or links? (And what does any of this have with mustard to do? :-) – tripleee Sep 24 '12 at 13:50
  • 1
    `&copy` is the most common. `&shy` is another. There are over one hundred of them. The W3C HTML5 list seems broken at the minute but they should be available on the WHATWG copy. – Alohci Sep 24 '12 at 13:53
  • List is here: http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references but the non-semi colon ones are mixed in with the ones that do have semi colons. – Alohci Sep 24 '12 at 13:55
  • Ooops , sorry. Your comments hadn't been showing up when I added mine. Anyway, why wouldn't they make it predictable and give them all `;`s? – wwaawaw Sep 24 '12 at 14:25
  • 1
    Browsers have always tried to fix broken markup, and at least one browser - probably netscape when it had the majority market share - decided that if authors forgot the semi-colon then well they'd just fix it for them. Once that happened, web pages came to rely on the behaviour and other browsers had to follow suit, otherwise the pages would look broken in their browsers. HTML5 just documents what has been long standing browser practice. – Alohci Sep 24 '12 at 14:30
3

The HTML5 spec explicitly describes now, what browsers used to do as error correction since the mid-90s: Show the thing verbatim, if it doesn't match a known character reference. Therefore, if you want your regex to work like a browser, you have to copy the browsers behaviour.

That means, you have to test against a complete list of known references, like the one mentioned by Jukka. You can abbreviate the term with clever use of parentheses,

[aeiou]uml

but you need to bake the same knowledge into the regex, that the browser has, in order to get the same result.

Edit: By the way, named entities might also have numbers in them, e.g., &ensp13;.

Boldewyn
  • 81,211
  • 44
  • 156
  • 212
2

Entity names used to have 2 to 7 letters, following SGML tradition, and this is still the case in the HTML 4.01 specification (and XHTML specifications). But HTML5 drafts add a large number of entities, called named character references there, and some of them are fairly long, like EmptyVerySmallSquare. So it would be better to avoid any fixed upper limit – or a lower limit larger than 1.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390