1

I'm writing a function that fishes out the src from the first image tag it finds in an html file. Following the instructions in this thread on here, I got something that seemed to be working:

preg_match_all('#<img[^>]*>#i', $content, $match); 

foreach ($match as $value) {
    $img = $value[0];
                           } 

$stuff = simplexml_load_string($img);
$stuff = $stuff[src];
return $stuff;

But after a few minutes of using the function, it started returning errors like this:

warning: simplexml_load_string() [0function.simplexml-load-string0]: Entity: line 1: parser error : Premature end of data in tag img line 1 in path/to/script on line 42.

and

warning: simplexml_load_string() [0function.simplexml-load-string0]: tp://feeds.feedburner.com/~f/ChicagobusinesscomBreakingNews?i=KiStN" border="0"> in path/to/script on line 42.

I'm kind of new to PHP but it seems like my regex is chopping up the HTML incorrectly. How can I make it more "airtight"?

Community
  • 1
  • 1
  • I'm not sure what's up, but the debugger in me is saying: Replace the xml_load calls with echo $img Also, it looks like you're overwriting $img with the LAST value every time you iterate in the foreach loop. Printing some debug statments might help clarify that, too. – ojrac Nov 28 '08 at 16:03
  • What HTML are you feeding in and what's the value of $img when the warning is thrown? – Greg Nov 28 '08 at 16:03
  • It would be very helpful to see the html you're passing into this. You might also consider printing out $img to make sure the regex patter is doing its job before it goes into simplexml. – enobrev Nov 29 '08 at 10:57

4 Answers4

2

These two lines of PHP code should give you a list of all the values of the src attribute in all img tags in an HTML file:

preg_match_all('/<img\s+[^<>]*src=["\']?([^"\'<>\s]+)["\']?/i', $content, $result, PREG_PATTERN_ORDER);
$result = $result[1];

To keep the regex simple, I'm not allowing file names to have spaces in them. If you want to allow this, you need to use separate alternatives for quoted attribute values (which can have spaces), and unquoted attribute values (which can't have spaces).

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
0

Most likely because the "XML" being picked up by the regex isn't proper XML for whatever reason. I would probably go for a more complicated regex that would pull out the src attribute, instead of using SimpleXML to get the src. This REGEX might be close to what you need.

<img[^>]*src\s*=\s*['|"]?([^>]*?)['|"]?[^>]*>

You could also use a real HTML Parsing library, but I'm not sure which options exist in PHP.

Kibbee
  • 65,369
  • 27
  • 142
  • 182
0

An ampersand by itself in an attribute is invalid XML (it should be encoded as “&amp;”), but some people still put it that way on URLs on HTML pages (and all browsers support it). Maybe there lies your problem.

If that is the case, you can sanitize your string before parsing it, substituting “&(?!amp;)” by “&amp;”.

angus
  • 2,305
  • 1
  • 15
  • 22
0

On a different subject:

foreach ($match as $value) {
    $img = $value[0];
                           } 

can be replaced with

$img = $match[count($match) - 1][0];

Something like this:

if (preg_match('#<img\s[^>]*>#i', $content, $match)) {
    $img = $match[0]; //first image in file only
    $stuff = simplexml_load_string($img);
    $stuff = $stuff[src];
    return $stuff;
} else {
    return null; //no match found
}
OIS
  • 9,833
  • 3
  • 32
  • 41
  • Hmm. That's returning a slightly different error: warning: simplexml_load_string() [0function.simplexml-load-string0]: Entity: line 1: parser error : Start tag expected, '<' not found in /path/to/script.php on line 43. –  Nov 28 '08 at 16:40