10

I'm trying to use regex to replace source attribute (could be image or any tag) in PHP.

I've a string like this:

$string2 = "<html><body><img src = 'images/test.jpg' /><img src = 'http://test.com/images/test3.jpg'/><video controls="controls" src='../videos/movie.ogg'></video></body></html>";

And I would like to turn it into:

$string2 = "<html><body><img src = 'test.jpg' /><img src = 'test3.jpg'/><video controls="controls" src='movie.ogg'></video></body></html>";

Heres what I tried :

$string2 = preg_replace("/src=["']([/])(.*)?["'] /", "'src=' . convert_url('$1') . ')'" , $string2);
echo htmlentities ($string2);

Basically it didn't change anything and gave me a warning about unescaped string.

Doesn't $1 send the content of the string ? What is wrong here ?

And the function of convert_url is from an example I posted here before :

function convert_url($url)
{
    if (preg_match('#^https?://#', $url)) {
        $url = parse_url($url, PHP_URL_PATH);
    }
    return basename($url);
}

It's supposed to strip out url paths and just return the filename.

Sampson
  • 265,109
  • 74
  • 539
  • 565
Ashesh
  • 939
  • 2
  • 11
  • 28
  • the original string and what you want to turn it into are both empty strings -- is something missing? – ametren May 18 '12 at 19:04
  • 1
    You really shouldn't parse HTML with regex. You should find a pretty comprehensive answer as to why if you search SO. In the meantime, may I suggest DOM or SimpleXML – GordonM May 18 '12 at 20:34
  • i mean try to replace in the regex all the " into \" but not the first and the last – Alexandre Khoury May 19 '12 at 04:02
  • possible duplicate of [Grabbing the href attribute of an A element](http://stackoverflow.com/questions/3820666/grabbing-the-href-attribute-of-an-a-element) – Gordon May 23 '12 at 06:08
  • Also, if you want to use regex and want to use a function in the replacement, you need `preg_replace_callback`. You cannot do `convert_url('$1')` like you do because that is evaluated before $1 exists. – Gordon May 23 '12 at 06:12

3 Answers3

15

Don't use regular expressions on HTML - use the DOMDocument class.

$html = "<html>
           <body>
             <img src='images/test.jpg' />
             <img src='http://test.com/images/test3.jpg'/>
             <video controls='controls' src='../videos/movie.ogg'></video>
           </body>
         </html>";

$dom = new DOMDocument;  
libxml_use_internal_errors(true);

$dom->loadHTML( $html ); 
$xpath = new DOMXPath( $dom );
libxml_clear_errors();

$doc = $dom->getElementsByTagName("html")->item(0);
$src = $xpath->query(".//@src");

foreach ( $src as $s ) {
  $s->nodeValue = array_pop( explode( "/", $s->nodeValue ) );
}

$output = $dom->saveXML( $doc );

echo $output;

Which outputs the following:

<html>
  <body>
    <img src="test.jpg">
    <img src="test3.jpg">
    <video controls="controls" src="movie.ogg"></video>
  </body>
</html>
Sampson
  • 265,109
  • 74
  • 539
  • 565
  • The dom document class is not very helpful if it is html embedded inside another HTML tag like for e.g. – Ashesh May 18 '12 at 19:08
  • 1
    @Ashesh I'm not I follow. You showed us PHP code - I'm showing you the solution. – Sampson May 18 '12 at 19:11
  • Well I'm sorry I should have been more clear. Here's what I'm talking about: "". In this case, the domdocument would not pickup on the image tag inside the javascript. That's why I need to use regex. – Ashesh May 18 '12 at 19:13
  • @Ashesh The code above will work on the PHP string you have provided here. It converts the `src` elements to point only to the filename. – Sampson May 18 '12 at 19:29
  • Sometimes it's not a good idea to load HTML parser. Especcialy on a short predefined text values (e.g. smth), where only src="" and alt="" could vary. – BasTaller Jun 19 '12 at 14:40
1

You have to use the e modifier.

$string = "<html><body><img src='images/test.jpg' /><img src='http://test.com/images/test3.jpg'/><video controls=\"controls\" src='../videos/movie.ogg'></video></body></html>";

$string2 = preg_replace("~src=[']([^']+)[']~e", '"src=\'" . convert_url("$1") . "\'"', $string);

Note that when using the e modifier, the replacement script fragment needs to be a string to prevent it from being interpreted before the call to preg_replace.

ilanco
  • 9,581
  • 4
  • 32
  • 37
1
function replace_img_src($img_tag) {
    $doc = new DOMDocument();
    $doc->loadHTML($img_tag);
    $tags = $doc->getElementsByTagName('img');
    foreach ($tags as $tag) {
        $old_src = $tag->getAttribute('src');
        $new_src_url = 'website.com/assets/'.$old_src;
        $tag->setAttribute('src', $new_src_url);
    }
    return $doc->saveHTML();
}
Abdo-Host
  • 2,470
  • 34
  • 33