2

I have to match local src's and make them load via the web. Example:

src="/js/my.js">

Becomes:

src="http://cdn.example.com/js/my.js">

This is what I have now:

if (!preg_match("#<script(.+?) src=\"http#i",$page)){ 
$page = preg_replace("#<script(.+?) src=\"#is", "<script$1 src=\"$workingUrl", $page); 
}

It works fine when it encounters something like this:

<script type='text/javascript' src='/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>

It fails when it encounters something like this:

<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>

If the script tag doesn't contain a src it will then find the src of the first image tag and switch out its URL.

I need to know how to get it to terminate the match on the script tag only and/or how to perform the replacement better.

V_RocKs
  • 134
  • 1
  • 13
  • possible duplicate of [How to parse and process HTML/XML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-xml-with-php) – Quentin Jan 10 '13 at 11:56
  • 1
    Any reason why you don't use `SimpleXML` or `DOMDocument`? – Passerby Jan 10 '13 at 11:57
  • @Passerby Heard of non-valid HTML? – Bogdan Burym Jan 10 '13 at 12:15
  • @BogdanBurim DOMDocument::loadHTML was created to attempt to make sense of non-wellformed HTML documents. – MatsLindh Jan 10 '13 at 13:03
  • @Passerby DOMDocument::loadHTML may fail. Well written regex - will never. – Bogdan Burym Jan 10 '13 at 13:47
  • 1
    @BogdanBurim "Well written" is a no-fail phrase. And speaking of malformed, ` – Passerby Jan 11 '13 at 03:22

2 Answers2

2

Barring the usage of DOMDocument::loadHTML and using the DOM instead, dropping the use of . and only accepting everything up to the first > as a fallback will probably work better (although not perfect, as there might in theoretical cases be other attributes to <script> that contain a >).

Using:

#<script([^>]+?) src=\"#is

as your pattern instead makes the pattern stop matching when it encounters the first > after <script.

MatsLindh
  • 49,529
  • 4
  • 53
  • 84
2

Definitely use a DOM parser. Xpath with DOMDocument will cleanly, reliably replace the script tags that:

  1. Have a src attribute and
  2. The src attribute does not start with http.

I could have further developed the xpath query expression to check for the leading http substring, but I didn't want to scare you off with more syntax.

Code: (Demo)

$html = <<<HTML
<html>
<head>
<script type='text/javascript' src='/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>
<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>
</head>
</html>
HTML;

$workingUrl = 'https://www.example.com';

$dom = new DOMDocument; 
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//script[@src]") as $node) {
    if (strpos($node->getAttribute('src'), 'http') !== 0) {
        $node->setAttribute('src', $workingUrl);        
    }
}
echo $dom->saveHTML();

Output:

<html>
<head>
<script type="text/javascript" src="https://www.example.com"></script>
<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>
</head>
</html>

The only slightly "scarier" xpath version: (Demo)

$dom = new DOMDocument; 
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//script[@src and not(starts-with(@src,'http'))]") as $node) {
    $node->setAttribute('src', $workingUrl);        
}
echo $dom->saveHTML();
mickmackusa
  • 43,625
  • 12
  • 83
  • 136