In general its not recommended to use regex to parse html, but if you have to use it,
for your problem, something like below will work.
In this regex, 'body' is OR'd with 'span' as an example. Also note that comments are ignored because they could hide html. Script is taken into account for the same reason.
I would leave the comment section in. You must be aware that scripts can alter the document rendering and use language constructs that can hide html that you may want to process. Of course that shouldn't be done with regex.
If you want, you can remove the 'script' sub-expression in the hopes of modifying possible string constants containing what you want to alter. Not recommended though.
Raw regex (modifiers: expanded, 'dot includes newlines')
In C# the regex captured buffers could be named so that each OR'd sub-expression contains the same names. Example: (?<begin> ..) .. (?<end> ..) | (?<begin> ..) .. (?<end> ..)
so that the replacement is just ["begin"] + ["end"]. This is buggy in Perl 5.10, so I just use the capture buffer numbers, Dot Net might work correctly.
Search
# (1,2)
( <!--.*?--> ) ()
|
# (3,4)
(
(?:
<script
(?>
(?:\s+(?:".*?"|'.*?'|[^>]*?)+)?
\s*
>
)(?<!/> )
.*?
</script\s*>
|
</?script (?:\s+(?:".*?"|'.*?'|[^>]*?)+)? \s*/?>
)
) ()
|
# (5,6)
( <(?:body|span) ) (?!\s*/?>)
\s+ (?:".*?"|'.*?'|[^>]*?)+
( /?> )
Replace
$1$2$3$4$5$6