First, to answer the direct question: Your regex has a bug since it will exclude a tag with a slash anywhere, not just at the end. For example it would exclude this valid opening tag: <a href="foo/bar.html">
because it has a slash in an attribute value.
We can fix that, but more seriously, this regex will lead to false positives, because it will also match inside comments and cdata sections, where the same characters doesn't represent a valid tag. For example:
<!-- <foo> -->
or
<![CDATA[ <foo> ]]>
Especially html strings embedded in scripts is likely to trigger false positives, and so is the regular use of <
and >
as comparison operators in JavaScript. And of course sections of html which is commented-out with <!-- -->
.
So to only match actual tags, you also need to be able to skip past comments and cdata sections. So you need the regex to also match comments and cdata, but only capture the opening tags. This is still possible using a rexep, but it becomes significantly more complex, for example:
(
<!-- .*? --> # comment
| <!\[CDATA\[ .*? \]\]> # CData section
| < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag
| (?<tag> < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > ) # opening tag - captured
| </ \w+ \s* > # end tag
)
And this just for XHTML conforming to the HTML compatibility guidelines. If you want to handle arbitrary XHTML you should also handle processing instructions and internal DTD's, since they can also embed false positives. If you also want to handle HTML there are additional complexities like the <script>
-tag. And if you also want to handle invalid HTML it gets yet more complex.
Given the complexity, I would not recommend going down that road. Instead, look for an off-the-shelf (X)HTML parsing library which can solve your problem.
A parser typically uses regular expressions (or similar) under the hood to split the document into "tokens" (doctype, start tags, end tags, text content etc.). But someone else will have debugged and tested these regexes for you! Depending on the type of parser it may further build a tree structure of elements by matching start tags to end tags. This will almost certainly save you a lot of time.
The exact parser library to use depend on your language and platform and the task you are solving. If you need access to the actual tag-substrings (e.g. if you are writing a syntax highlighter for HTML) you need to use a SAX-style parser which exposes the syntax tokens directly.
If you are only performing the tag-matching in order to manually build a syntax tree of elements, then a DOM parser does this work for you. But a DOM parser does not expose the underlying tag syntax, so does not solve the exact problem you describe.
You should also consider if you need to to parse invalid HTML. This is a much more complex task, but on the wild web most of the HTML is actually invalid. Something like Pytons html5lib can parse invalid HTML.