In this answer I'm going to demonstrate that tags can be matched from a regex, contrary to the believings in some comments that a tag cannot be identified but with a complete HTML/XML parser.
For the demonstration I shall use the subset of the grammar rules for XML from the www.www.org specification for XML 1.1, available there, extending to all the rules reachable from STag and EmptyElemTag, which are the tags we want to match. As there are no backward recursive rules, I'll demonstrate that this set of rules can be converted to a regexp to parse start and empty tags respectively.
As xml uses UTF character encodings and it allows characters over the range \u0000-\uffff, I have to select some notation for the character classes in the extended UTF encoding, so I shall use a nonstandard extension to the \u notation consisting of using five hex digits instead of four, to simplify this grammar-to-regexp conversion (to allow for the allowed characters in the range 0x10000-0xeffff)
Borrowed from the xml specification for XML version 1.1 is the syntax for the start and empty element tags:
STag ::= '<' Name (S Attribute)* S? '>'
EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'
Name ::= (NameStartChar NameChar*)
NameChar ::= (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f-\u02040])
NameStartChar ::= ([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])
S ::= ([\u00020\u00009\u0000d\u0000a]+)
Attribute ::= (Name Eq AttValue)
Eq ::= (S? '=' S?)
AttValue ::= ( '"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" )
Reference ::= (EntityRef | CharRef)
EntityRef ::= ('&' Name ';')
CharRef ::= ('&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')
To construct the regular expression that accepts start tag and empty tags, I have begun with the above grammar and construct from it a simple start rule that accepts a start and an empty tag:
Start ::= STag | EmptyElemTag
then substituting all the nonterminals by the (properly parenthesized) right sides of each rule, until I only have terminal elements in the right side and regexp operators:
Start ::= '<' Name (S Attribute)* S? '>' | '<' Name (S Attribute)* S? '/>'
I can do some operations to group terms and get
Start ::= '<' Name (S Attribute)* S? '/'?'>'
Now substitute Attribute
Start ::= '<' Name (S Name Eq AttValue)* S? '/'? '>'
Now substitute AttValue
Start ::= '<' Name (S Name Eq ('"' ([^<&"] | Reference)* '"' | "'" ([^<&'] | Reference)* "'" ))* S? '/'? '>'
Now substitute Reference
Start ::= '<' Name (S Name Eq ('"' ([^<&"] | EntityRef | CharRef)* '"' | "'" ([^<&'] | EntityRef | CharRef)* "'" ))* S? '/'? '>'
Now substitute EntityRef
Start ::= '<' Name (S Name Eq ('"' ([^<&"] | '&' Name ';' | CharRef)* '"' | "'" ([^<&'] | '&' Name ';' | CharRef)* "'" ))* S? '/'? '>'
Now substitute CharRef
Start ::= '<' Name (S Name Eq ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* S? '/'? '>'
Now Eq
Start ::= '<' Name (S Name S? '=' S? ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* S? '/'? '>'
Next S
Start ::= '<' Name (([\u00020\u00009\u0000d\u0000a]+) Name ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' Name ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'
Now substitute Name
Start ::= '<' (NameStartChar NameChar*) (([\u00020\u00009\u0000d\u0000a]+) (NameStartChar NameChar*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (NameStartChar NameChar*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (NameStartChar NameChar*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'
Now substitute NameChar
Start ::= '<' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) (([\u00020\u00009\u0000d\u0000a]+) (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (NameStartChar (NameStartChar | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'
And last NameStartChar
Start ::= '<' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) (([\u00020\u00009\u0000d\u0000a]+) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ([\u00020\u00009\u0000d\u0000a]+)? '=' ([\u00020\u00009\u0000d\u0000a]+)? ('"' ([^<&"] | '&' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* '"' | "'" ([^<&'] | '&' (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) (([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff]) | [-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*) ';' | '&#' [0-9]+ ';' | '&#x' [0-9a-fA-F]+ ';')* "'" ))* ([\u00020\u00009\u0000d\u0000a]+)? '/'? '>'
finally, after substituting 'c'
by c
and eliminating undesired blank spaces, the regex leads to:
<(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*)(([\u00020\u00009\u0000d\u0000a]+)(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*)([\u00020\u00009\u0000d\u0000a]+)?=([\u00020\u00009\u0000d\u0000a]+)?(\"([^<&\"]|&(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*);|&#[0-9]+;|&#x[0-9a-fA-F]+;)*\"|\'([^<&\']|&(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])(([:A-Za-z_\u000c0-\u000d6\u000d8-\u000f6\u000f8-\u002ff\u00370-\u0037d\u0037f-\u01fff\u0200c-\u0200d\u02070-\u0218f\u02c00-\u02fef\u03001-\u0d7ff\u0f900-\u0fdcf\u0fdf0-\u0fffd\u10000-\ueffff])|[-.0-9\u000b7\u00300-\u0036f\u0203f\u0203f\u02040])*);|&#[0-9]+;|&#x[0-9a-fA-F]+;)*\'))*([\u00020\u00009\u0000d\u0000a]+)?/?>
Of course you can have more regexps that allow you to match a start/empty tag, but this is one of the simplests that I have been able to develop to cope with the scenarios that have been pointed out in the comments.
A simpler one could be:
<[iI][mM][gG][ \t\n\r]+([^>"']|"[^"]*"|'[^']*')*>
if you are not dealing with UTF chars outside the range \u0000--\u007f (ascii range) and you know the HTML file is valid. (this last one can be erroneous, use with care I have constructed it in my head and can take some weird cases mistakenly)