0

The pattern and text are shown online https://regex101.com/r/aL5dD4/2 The pattern should find node values of span elements that are located between code tags.

Text is as follows:

<code>
    <div>
        <span ds = 'dsds'>12 3 ->;:4</span><span>abc</span>
    </div>
</code>

Regex pattern is as follows:

/(?<=<code>).*?<span[^>]*?>(.*?)(?=<\/span>.*?<\/code)/gs

I need it to match both node values 12 3 ->;:4 and abc.

But only the first is found.

How to get this? Thank you.

trzczy
  • 1,325
  • 2
  • 18
  • 43

3 Answers3

0

Regex is never a good tool to parse out HTML/XML. Use DOM as below:

$html=<<<EOF
<code>
    <div>
        <span ds = 'dsds'>12 3 ->;:4</span><span>abc</span>
    </div>
</code>
EOF;

$xpath = new DOMXPath(@DOMDocument::loadHTML($html));    
$nodeList = $xpath->query('//code/div/span');

$vals = array();
for($i=0; $i < $nodeList->length; $i++) {
    $vals[] = $nodeList->item($i)->nodeValue;
}

print_r( $vals );

Code Demo

Output:

Array
(
    [0] => 12 3 ->;:4
    [1] => abc
)
anubhava
  • 761,203
  • 64
  • 569
  • 643
0

Though I agree with the sentiment against using Regex for HTML, to answer your question, eliminating the look behind (?<=<code>) allows the Regex to find the second occurrence as well. This leaves the following regex:

<span[^>]*?>(.*?)(?=<\/span>.*?<\/code)

NOTE:This returns two separate matches and does not require the string to be found in a element. To require that a match be in a code block, you could use @HamZa's commented solution (though this solution provides 1 match with two groups as the strings), which may even be closer to what you are looking for.

rtmh
  • 439
  • 3
  • 9
0

One way to do is first grab the code block with something like /<code[^>]*?>(.*?)<\/code>/gs and on these matches /<span[^>]*?>(.*?)<\/span>/gs.

These 'simpler' regexes also make it easier to debug, should you run into problems. Also it this approach extracts all spans from multiple code blocks sequentially.

fabianegli
  • 2,056
  • 1
  • 18
  • 35