You should never use regular expressions to match nested patterns like brackets or in your example XML tags.
The problem is that in quite all languages regular expressions (Except two or three) can not match nested things.
Examples:
Regex: /<span class="awesome">.*<\/span>/
Matches: <span>
<span class="awesome">text</span></span>
Regex: /<span class="awesome">.*?<\/span>/
Matches: <span class="awesome"><span>text</span>
</span>
Solutions:
There are some solutions to handle xml/html data correctly. The first one would be to use a xml library. Quite all high level langues have pretty good xml libraries - Just use one.
For debugging and quick and dirty solutions you can split your content also and analyze it by your self.
Because my R skills are very bad here a simple JS solution:
// simple xml string
var text = '<br/><br/><span class="header3">Statistical Analysis 1 for Percent Change From Baseline in the <span class="hit_inf">Psoriasis</span> Area Severity Index (PASI) Score at Week 16</span>'
// split the string
var parts = text.split(/(<\/?)(.*?)(\/?>)/);
// predefine some vars
var level = 0;
var path = ["!ROOT!"]
var isOpen = false;
var isTag = false;
// go trough all parts
parts.forEach((match, index, all) => {
// ignore empty stuff
if (!match) return;
// check type of match
switch (match) {
case "<":
level++;
isOpen = true;
isTag = true;
break;
case "</":
level--;
isOpen = false;
isTag = true;
break;
case ">":
// add to path if open, otherwise remove last one from path
if (isOpen) {
path.push(all[index - 1]);
} else {
path.pop()
}
isTag = false;
break;
case "/>":
level --;
isTag = false;
break;
default:
// just print it out
if (isTag) {
console.log(new Array(level + (isOpen ? 0 : 1)).join(" "), "[TAG]", isOpen ? "[OPEN]" : "[CLOSE]", match);
} else {
console.log(new Array(level + 1).join(" "), "[TEXT]", match);
}
}
});