RegEx for excluding special pattens

Question

I'm trying to write a regular expression to parse an HTML string.

I need to find a single word wrapped in tag which is not followed by other specific tags, for example. The following regexp seems to work fine until there's a whitespace between the tags.

preg_match('/\<b[^<]*?\>([^\s<]+?)\<\/b\>\s*(?!\<br\>)/ui', '<b>word</b> <br>');

Expected behaviour when there's no spaces:
https://regex101.com/r/mKTmM3/11

Unexpected behaviour with a space between and :
https://regex101.com/r/mKTmM3/10

How do I solve this problem?

[Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) — Toto, May 22 '19 at 16:30

Emma · Answer 1 · 2019-05-22T16:07:27.777

Here, we might be able to solve this problem.

Let's start with a not followed by a word strategy to exclude our undesired  , to see if that would work. For that, we just need to close our expression with an end char and we might want to not bound it with start char:

((<b>([a-z]+)<\/b>)((?!<br>).)*)$

Demo

We have also added extra capturing groups (), which we can remove it, if we don't wish to have it.

Test

$re = '/((<b>([a-z]+)<\/b>)((?!<br>).)*)$/im';
$str = '<b>word</b><br>
<b>word</b>   <br>
<b>word</b> in text
half<b>word</b> ';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

Output

array(2) {
  [0]=>
  array(5) {
    [0]=>
    string(19) "<b>word</b> in text"
    [1]=>
    string(19) "<b>word</b> in text"
    [2]=>
    string(11) "<b>word</b>"
    [3]=>
    string(4) "word"
    [4]=>
    string(1) "t"
  }
  [1]=>
  array(5) {
    [0]=>
    string(12) "<b>word</b> "
    [1]=>
    string(12) "<b>word</b> "
    [2]=>
    string(11) "<b>word</b>"
    [3]=>
    string(4) "word"
    [4]=>
    string(1) " "
  }
}

Demo

const regex = /((<b>([a-z]+)<\/b>)((?!<br>).)*)$/igm;
const str = `<b>word</b><br>
<b>word</b>   <br>
<b>word</b> in text
half<b>word</b> `;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Thank you @Emma, this is much better than what i've started with! Here is the thing though... is there a chance to do it without using the end char? `br` is only one possible example, there might be paragraphs and divs as well, and some text in between. I've provided more use cases that are still not covered. Also, I added a negative lookbehind to match `b` tag followed by some text and then by `br` [https://regex101.com/r/FeFqON/2](https://regex101.com/r/FeFqON/2) Is this solution acceptable? `(((\w+)<\/b>)((?!(?<![\w\d])(
|
|<\/p>)).)*)$` — YNWA, May 22 '19 at 16:53

RegEx for excluding special pattens

1 Answers1

Demo

Test

Output

Demo