1

Ok. Admittedly, I am not the best at working with regular expressions. What I am doing is a screen scrape, then trying to fix the img src values in the embedded images to point back to the original domain. This is the regex I have been trying variations of (too many to list - here's the current one):

preg_match_all('/<img\b[^>]*>/i', $html, $images);  

What this ends up doing is to replace all < with />. What I need it to do is just return the (currently) five images on the page in an array so that I can work with those to fix their src values, then write them back to $html, which is set at the beginning of the file:

$html = file_get_contents($target_url);
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
CSmith
  • 11
  • 1
  • 3
    It seems like you're just trying to get the src attribute. Will DomDocument or even simple xml not do? – Explosion Pills Feb 22 '11 at 22:44
  • 3
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Brad Feb 22 '11 at 22:44

1 Answers1

5

Basically, don't do this with regex. You can parse HTML with regex, but it is almost certainly not worth the effort.

Do it with genuine DOM parsing instead, using the DOMDocument class:

$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
    $image->setAttribute('src', 'http://example.com/' . $image->getAttribute('src'));
}
$html = $dom->saveHTML();
lonesomeday
  • 233,373
  • 50
  • 316
  • 318