0

How to match the inside (or outside if it's easier) span tag only?

<br><br><span class="header3">Statistical Analysis 1 for Percent Change From Baseline in the <span class="hit_inf">Psoriasis</span> Area Severity Index (PASI) Score at Week 16</span>

I'm trying: (?:<span).*?(<span).*?(?:</span>).*</span> but only second <span is matched in separate group. I need to use only span keyword. Any suggestions?

Peter.k
  • 1,475
  • 23
  • 40

1 Answers1

0

You should never use regular expressions to match nested patterns like brackets or in your example XML tags. The problem is that in quite all languages regular expressions (Except two or three) can not match nested things.

Examples:

Regex: /<span class="awesome">.*<\/span>/

Matches: <span><span class="awesome">text</span></span>


Regex: /<span class="awesome">.*?<\/span>/

Matches: <span class="awesome"><span>text</span></span>

Solutions:

There are some solutions to handle xml/html data correctly. The first one would be to use a xml library. Quite all high level langues have pretty good xml libraries - Just use one. For debugging and quick and dirty solutions you can split your content also and analyze it by your self.

Because my R skills are very bad here a simple JS solution:

// simple xml string
var text = '<br/><br/><span class="header3">Statistical Analysis 1 for Percent Change From Baseline in the <span class="hit_inf">Psoriasis</span> Area Severity Index (PASI) Score at Week 16</span>'

// split the string
var parts = text.split(/(<\/?)(.*?)(\/?>)/);

// predefine some vars
var level = 0;
var path = ["!ROOT!"]
var isOpen = false;
var isTag = false;

// go trough all parts
parts.forEach((match, index, all) => {
    // ignore empty stuff
    if (!match) return;

    // check type of match
    switch (match) {
        case "<":
            level++;
            isOpen = true;
            isTag = true;
            break;
        case "</":
            level--;
            isOpen = false;
            isTag = true;
            break;
        case ">":
            // add to path if open, otherwise remove last one from path
            if (isOpen) {
                path.push(all[index - 1]);
            } else {
                path.pop()   
            }
            isTag = false;
            break;
        case "/>":
            level --;
            isTag = false;
            break;
        default:
            // just print it out
            if (isTag) {
                console.log(new Array(level + (isOpen ? 0 : 1)).join("  "), "[TAG]", isOpen ? "[OPEN]" : "[CLOSE]", match);    
            } else {
                console.log(new Array(level + 1).join("  "), "[TEXT]", match);    
            }
    }
});
kpalatzky
  • 1,213
  • 1
  • 11
  • 26