PHP Regex Match Tag Lookahead Problems

Question

I am trying to check a webpage for the existence of a google analytics script tag. This seem like it should be easy but my regex skills seem to be lacking. So as a simple example I was trying to match the open and close script tags which have the "google-analytics" between them.

So for example if you have:

<script scr="whatever"></script>
<script>other script</script>
blah blah blah
<script>
   blah blah google-analytics
<script>

Then the regex:

/<script>([s/S/]*?google-analtics[s/S/]*?)<\/script>/

This will return a string starting at the first script tag and include the other script tags. So something like:

other script</script> blah blah blah <script> blah blah google-analytics

But of course I only want the string

blah blah google-analytics

So the next step is to include a negative look ahead like:

 /<script>((?![s/S/]*?script)[s/S/]*?google-analytics[s/S/]*?)<\/script>/

But that doesn't seem to work. I tried a bunch of different combination of capture groups and the '[s/S/]*?' in front and behind.

Basically I am trying to match a string as long as it doesn't include a substring. Which sounds like a common problem but for the life of me I can't get to work. I have google a ton and all of the example are straightforward but don't seem to work. I have been testing using https://regex101.com/r/hN5dK5/2

Any insight would be helpful. (script is running as php)

score 2 · Accepted Answer · edited May 23 '17 at 12:23

Regex method

First, use the verbose mode to have a better readability.
Consider the following regex then:

<script>                 # match "<script>" literally
(?:(?!</script>)[\s\S])* # match anything except "</script>"
(?:google-analytics)     # look for "google-analytics" literally
(?:(?!</script>)[\s\S])* # same pattern as above
</script>                # closing "</script>" tag

See a demo for this approach in your updated demo.

Parser method(s)

SimpleXML

Generally, analyzing HTML with regular expressions is considered bad practice on SO (see this famous post), so you might as well use an approach with a parser and appropriate xpath queries:

$xml = simplexml_load_string($html);
$scripts = $xml->xpath("//script[contains(text(),'google-analytics')]");
print_r($scripts);

See a demo on ideone.com.

DOMDocument

One could argue, that SimpleXML was not really designed to parse HTML files (rather XML files as the name suggests), so for the sake of completeness, an example with DOMDocument at last:

$doc = new DOMDocument();
$doc->loadHTML($html);

$xpath = new DOMXpath($doc);
$scripts = $xpath->query("//script[contains(text(),'google-analytics')]");
foreach ($scripts as $script) {
    // do sth. useful here
    print_r($script);
}

For sure when searching the DOM for tags going with a DOM parser is usually a better path. Although all of the PHP Dom parsers have side effects. For example if you wanted to add the script tag if it was missing, all of the DOM parsers I found will alter the rest of the html. This is only a problem if you want your html to be formated for human readability. — Patrick_Finucane, Apr 23 '16 at 01:04

score 0 · Answer 2 · answered Apr 21 '16 at 18:40

The problem is that the look ahead looks all the way to the end of the page, so it might work but only on the last script tag.

A work around I found was to limit the wildcard search to anything other than a '<', like:

/<script[^>]*>([^<]*?google-analytics.com[\s\S]*?)<\/script>/

The part:

[^<]*?

Matches any character not a '<'. That makes sure there isn't any other tags between the 'script' tag and the google string.

PHP Regex Match Tag Lookahead Problems

2 Answers2

Regex method

Parser method(s)

SimpleXML

DOMDocument