Regexp with negative lookahead and xhtml

Question

I have the following regular expression which performs a negative lookahead.

/\b(\w+)\b(?![^<]*</{0,1}(a|script|link|img)>)/gsmi

What I want to do is to match all text including html except a, script, link and img. Now the problem occurs when an img tag is being used.

An image tag has no closing tag so the expression will not exclude the img tags.

<p>This is a sample text <a href="#">with</a> a link and an image <img src="" alt="" /> and so on</p>

The regular expression should not match the anchor (not even between the opening and closing tag) and it should not match the img.

I think I am almost there but I can't get it to work properly. This is what I've tried as well:

/\b(\w+)\b(?![^<]*</{0,1}(a|script|link)>)(?![^\<img]*>)/gsmi

Somehow the last one will only work (on img tag) when there is no "i" or "m" or "g" in the img tag. When you add something like height= it will not match.

Edit The goal is to extract all words from the text except those between anchor and image tags and there might be a chance that there is no html in it at all

Hint: Easier and recommended to use [`DOM`](http://php.net/manual/en/book.dom.php), and just a quick note you can't use whole words inside of character classes, i.e `[^\ — hwnd, Sep 17 '14 at 15:02
It might be easier yes, but for now there is a lot depending on that one regular expression. I will certainly have a look at the DOM solution. — ppr, Sep 17 '14 at 15:10
I was just stating that for something like a simple case, it is not always bad to use a regex. But I recommend `DOM` anyday over regex for intermediate cases, using regex you could eventually falsify your match. — hwnd, Sep 17 '14 at 15:13

score 0 · Accepted Answer · edited May 23 '17 at 11:57

I know you asked for a regex, but here is a solution using something that won't summon Cthulhu.

Example:

$html = <<<'HTML'
<p>This is a <em>sample</em> text <a href="#">with</a>
 a link and an image <img src="" alt="" /> and so on</p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);

foreach($xpath->query('//a | //link | //script | //img') as $node) {
    $node->parentNode->removeChild($node);
}

echo $dom->saveHTML();

Output:

<p>This is a <em>sample</em> text 
 a link and an image  and so on</p>

I recommend considering it as an option.

Regexp with negative lookahead and xhtml

1 Answers1

Example:

Output: