David, the reason (<span [^>]*>)>
is that you have a small typo.
You see, that expression tries to match two closing >
: look closely at the end >)>
. For instance, it would match <span hey there>>
but not <span hey there>
To match the opening span, make sure you only have one >
.
With all the disclaimers about using regex to match html, this regex will do:
<span[^>]*>
If you sometimes expect SPAN
, make sure to make it case-insensitive.
Only if you have time: an additional flourish
In a comment, @DavidEhrmann points out that the regex above will match <spanner>
. If you want to make him happy and ensure that if the span is more than just <span>
it always contains a space after span
, you can use:
<span(?: [^>]*)?>
However, in my view, that is an unnecessary flourish. When we parse html with regex, we always know that we are using a crude tool, and we rely on the input to be fairly well-formed. For instance, with the revised regex above, there are still a million ways that we can match improper html, for instance: <span classification>
What to do? Nothing. Know your tools, know what they can do, know the risks, and decide when the situation warrants regex and when it warrants a DOM parser.