The DOM/XPath (i.e. correct) way:
<?php
$html = '
<p>
<img alt="" src="/emailimg/interdigital_old.jpg" style="width: 377px; height: 245px; " />Some text here.</p><a href="site.html">test</a>
';
$dom = new DOMDocument('1.0');
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$links = array();
foreach ($xpath->query('//img/@src') as $img) $links[] = $img->value;
print_r($links);
Tested and working.
EDIT
The reason your regex doesn't work is twofold:
You have declared your regex using a double-quoted string. This will often result in things you do not expect and are not entirely obvious, since a double quoted string will interpolate certain escape sequences itself, before it is passed to PCRE. The problem this was causing in your case is the \1
was being interpreted as an octal character definition (as defined here), so your expression had a literal 0x01
(start of heading) character in it, instead of the \1
string that you wanted PCRE to use as a back reference.
I find that when I have a problem like this, a good place to start is to simply echo
the expression to screen to see how PHP interpolated the string you declared in your script. Here is a demonstration of that particular problem.
([\"']??)
- the second question mark is breaking it. I'm not actually sure what you were trying to accomplish with this, was it just a mis-type? I'm having a little difficulty working out exactly how PCRE is interpreting this and exactly why it breaks it, but suffice it to say that it does, and the second question mark needs to go. FTR, the effect that it has is that the expression still matches the <img>
tag, but the following capture group (the data you actually want) is empty.
Now let's break down the regex and see how it can be improved:
<\s*?img
- the non-greedy *
here is pointless, since \s
only matches white space and the next sequence will be alphas, simply <\s*img
will suffice. I'm not actually certain that HTML tags are allowed to have leading whitespace between the opening <
and the tag name, but I guess it doesn't do any harm to allow it, since proper parsers probably will.
\s[^>]*?src=(["']??)
- as previously mentioned the ??
in the capture group is breaking the expression and I'm not certain what you were trying to do with it in the first place. Also, I think again the non-greedy *
is pointless because the tag will end with >
, and if we haven't found src
by the end it's not a match anyway. Plus if we are allowing whitespace in places where it shouldn't be but parsers will probably allow, we should probably allow it around the =
. I'd rewrite this to \s[^>]*src\s*=\s*(["']?)
.
([^"' >]*?)\1
- Assuming that you are concerned about being able to handle unquoted attributes, no complaints here. Of course if you do know that attributes will always be quoted, you could simply use ([^\1]*?)\1
and drop the ?
from the preceding capture group where we determined the quote type in use.
[^>]*?>
- no complaints here.
/si
- the s
modifier is pointless because there are no .
s anywhere in the expression. It's doesn't do any harm, but neither does it help, so it is superfluous.
So, putting all that together, here's how I would write the regex:
/<\s*img\s[^>]*src\s*=\s*(["']?)([^"' >]*?)\1[^>]*>/i
...which when converted to a PHP string declaration with the quotes properly escaped, looks like this:
$expr = '/<\s*img\s[^>]*src\s*=\s*(["\']?)([^"\' >]*?)\1[^>]*>/i';
...which works nicely, by the way.
Now, I still maintain that the DOM method is better even considering the extra code, since it will probably catch edge cases that my regex skillz have forgotten about. Although admittedly regex does seem to be somewhat faster.