Your problem is here: <(\w+).+?(?=>)>
This says:
- open an angle bracket
- consume as many word characters as possible (min 1)
- consume as few characters as possible (min 1)
- make sure a closing angle bracket follows
- consume the closing angle bracket
First of all, step 4 is superfluous; you know you will have a closing bracket next, otherwise step 5 will fail to match.
But the bigger problem is step 3. Let's see what happens on <h5>Thing</h5>
:
<
h5
(because >
is not a word character any more)
>Thing</h5
, because this is the least amount matched before a closing angle bracket (remember, matching 0 characters here is not an option)
- Make sure next is
>
>
Anyway, in the simple case, what you want can be done by /<\/?.+?>/
. This will break if attributes have values that include a greater than symbol: <div title="a>b">
. Avoiding this is possible, but it makes the regexp a bit more complex, kind of like this (but I may have forgotten something):
<\w+(?:\s+\w+(?:=(?:"[^"]*"|'[^']*'|[^'"][^\s>]*)?)?)*\s*>|<\/\w+>