0

I have some markup which I need to crawl for images, and check if the images exists in the paths they have specified. If an image does not exists in location A, the path should be replaced with location B.

I'm wondering what would be the most efficient way of achieving this?

crappish
  • 2,688
  • 3
  • 28
  • 42
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Nov 17 '10 at 15:12
  • is it your markup or why do you need to do it that way? Why can't you check if the images exist BEFORE generating the markup? – Simon Nov 17 '10 at 15:12
  • Because the markup isn't generated by PHP, it's hand written. The markup is grabbed by PHP and then ran in another location, thus the image paths are different. The reason I don't want to use absolute paths or just change every path is that I want to retain simple "overloading" of the images on application basis; if the application doesn't have picture X, then redirect the path to common location, where all the default pictures are. – crappish Nov 18 '10 at 11:42

2 Answers2

2

Use PHP's SimpleXML. It is quite easy to use. Here's an example (which probably won't work, but you get the idea):

<?php

$document = simplexml_load_file('dah_file.html');

foreach ($document->children() as $child)
{
  if !file_exists($child['src'])
  {
    $child['src'] = 'path/to/image.png';
  }
}

print($document->asXml());

?>
Blender
  • 289,723
  • 53
  • 439
  • 496
  • And he can use file_get_contents($image_url) to check if the file exists. – Matt Williamson Nov 17 '10 at 15:31
  • Thanks! I completely forgot about DOM. I use that a lot more than SimpleXML, but I fell away from coding PHP for a few months. I wish they would've chosen a catchier name for the best DOM parser... – Blender Nov 17 '10 at 15:48
  • Ahh. The DOMDocument seems like it's exactly what I need! Except that the DOMDocument stuffs DOCTYPE, html and body tags there, when I'm loading just document fragments... Markup is then JSON encoded into the document and transferred to the requesting client, in many cases this doubles the markup payload (when the markup is really simple), which isn't that brilliant, especially as the app is targeted for mobile use, where every byte counts. – crappish Nov 18 '10 at 13:47
  • Solved it with simple hack of adding extra DIV wrapper and just taking the contents of that DIV. – crappish Nov 18 '10 at 14:48
1

You could use regular expressions here. Make a regular expression to match the src attribute of <img> tags and use it in Preg_Replace_Callback function.

Something like this (more or less pseudo code):

$htmlContent = Preg_Replace_Callback (
    '/<img src="(.*)"/is',
    function ( $matches ) {
        if ( ImageExists ( $matches[1] ) )
            return $matches[1];
        else
            return '/path/to/some/other/image.jpg';
    },
    $htmlContent
);

You'll have to provide the ImageExists() function off course, and a correct regex.

Jan Hančič
  • 53,269
  • 16
  • 95
  • 99
  • This is unreliable. What if the images are not written exactly like in your pattern? – Gordon Nov 17 '10 at 15:40
  • I also agree: what happens if there is an `alt=` attribute before the `src=`? You would have to include that too... This *would* work though if the HTML markup is consistent throughout all of the documents – Blender Nov 17 '10 at 16:00
  • This sounds like it would be just what I need. Awsum! However, is there a way to get the pattern to catch the src even though there would be alt, id etc. tags before the src? Sadly, I'm rather useless myself when it comes to regular expressions.. :/ – crappish Nov 18 '10 at 11:45
  • @all : I said you'd have to roll your own regular expression. That was just an example :) – Jan Hančič Nov 18 '10 at 12:35