11

I'm doing some HTML stripping using regular expressions (yes, I know, never parse HTML with regexes, but I'm just stripping it, and I also unfortunately cannot use any external libraries). I'm using a regex from the Regular Expressions Cookbook, and it has worked great, except I just ran into this problem:

In the string Bob Saget <bobs@aol.com>, my regex is matching the email as a tag.

So my question is, is the @ sign a valid XML or HTML tag character? (I'm not asking whether or not it is valid within an attribute; I know that it is) If it is not, I will be able to successfully exclude it in my regex.

I'm not sure where to look this up. I looked here and I think that says that in XML, the at-sign is not allowed in a tag; however, I would appreciate some concrete proof.

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
NickAldwin
  • 11,584
  • 12
  • 52
  • 67
  • 1
    The problem is rather your naked angled brackets, which should be given by entity or character references. The '@' is a perfectly valid character in any flavour of HTML. – Kerrek SB Aug 15 '11 at 13:50
  • 1
    @Kerrek Of course the `@` sign is a valid character. But is it valid in a tag? If I were to give a HTML or XML parser a tag with at signs in it, would it parse it? – NickAldwin Aug 15 '11 at 14:00
  • By "valid tag" do you mean "valid element type name"? The answer is "no", [see here](http://www.w3.org/TR/REC-xml/#NT-NameChar) for a list of valid characters. The element type name must be a `Name`. Quote: "The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names [...]" ... ah, you already found that. – Kerrek SB Aug 15 '11 at 14:16
  • 1
    @NickAldwin - the `NameChar` specification is a formal grammar. Anything that's not explicitly included is excluded. Your edit should be moved to an answer. – parsifal Aug 15 '11 at 14:17
  • firefox seems to support it, but only a few people create custom elements and I don't think they would ever use @ in the tagname. Don't strip them, and encode the <, >, & – Gerben Aug 15 '11 at 14:23

1 Answers1

16

After another look at the XML Specification:

A tag consists of:

'<' Name (S Attribute)* S? '>'

A Name consists of:

NameStartChar (NameChar)*

A NameStartChar consists of:

":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

A NameChar consists of:

NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

The @ sign is U+0040

So the @ sign is not valid in a NameChar or a NameStartChar, and thus not valid in a Name.

NickAldwin
  • 11,584
  • 12
  • 52
  • 67
  • 2
    It's not clear whether this is applicable to HTML, which your original question was focused on. – BoltClock Mar 03 '13 at 11:46
  • 1
    The WhatWG HTML specification only allows `[a-zA-Z]` as valid `NameStartChar`. For `NameChar`, it allows `[^\s\0>/]`. – Azmisov Jan 12 '14 at 00:32