How can I preg_match script tag src, but avoid effecting img tag src?

Question

I have to match local src's and make them load via the web. Example:

src="/js/my.js">

Becomes:

src="http://cdn.example.com/js/my.js">

This is what I have now:

if (!preg_match("#<script(.+?) src=\"http#i",$page)){ 
$page = preg_replace("#<script(.+?) src=\"#is", "<script$1 src=\"$workingUrl", $page); 
}

It works fine when it encounters something like this:

<script type='text/javascript' src='/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>

It fails when it encounters something like this:

<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>

If the script tag doesn't contain a src it will then find the src of the first image tag and switch out its URL.

I need to know how to get it to terminate the match on the script tag only and/or how to perform the replacement better.

possible duplicate of [How to parse and process HTML/XML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-xml-with-php) — Quentin, Jan 10 '13 at 11:56
@BogdanBurim DOMDocument::loadHTML was created to attempt to make sense of non-wellformed HTML documents. — MatsLindh, Jan 10 '13 at 13:03
@Passerby DOMDocument::loadHTML may fail. Well written regex - will never. — Bogdan Burym, Jan 10 '13 at 13:47
@BogdanBurim "Well written" is a no-fail phrase. And speaking of malformed, ` — Passerby, Jan 11 '13 at 03:22

score 2 · Answer 1 · answered Jan 10 '13 at 13:15

Barring the usage of DOMDocument::loadHTML and using the DOM instead, dropping the use of . and only accepting everything up to the first > as a fallback will probably work better (although not perfect, as there might in theoretical cases be other attributes to <script> that contain a >).

Using:

#<script([^>]+?) src=\"#is

as your pattern instead makes the pattern stop matching when it encounters the first > after <script.

mickmackusa · Accepted Answer · 2021-03-11T23:10:13.773

Definitely use a DOM parser. Xpath with DOMDocument will cleanly, reliably replace the script tags that:

Have a src attribute and
The src attribute does not start with http.

I could have further developed the xpath query expression to check for the leading http substring, but I didn't want to scare you off with more syntax.

Code: (Demo)

$html = <<<HTML
<html>
<head>
<script type='text/javascript' src='/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>
<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>
</head>
</html>
HTML;

$workingUrl = 'https://www.example.com';

$dom = new DOMDocument; 
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//script[@src]") as $node) {
    if (strpos($node->getAttribute('src'), 'http') !== 0) {
        $node->setAttribute('src', $workingUrl);        
    }
}
echo $dom->saveHTML();

Output:

<html>
<head>
<script type="text/javascript" src="https://www.example.com"></script>
<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>
</head>
</html>

The only slightly "scarier" xpath version: (Demo)

$dom = new DOMDocument; 
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//script[@src and not(starts-with(@src,'http'))]") as $node) {
    $node->setAttribute('src', $workingUrl);        
}
echo $dom->saveHTML();

How can I preg_match script tag src, but avoid effecting img tag src?

2 Answers2