-4

I have this regex:

preg_match_all("/<\s*?img\s[^>]*?src=([\"']??)([^\"' >]*?)\1[^>]*?>/si", $content, $m);

The idea is to find all image links in a piece of HTML. Given this content:

<p>
    <img alt="" src="/emailimg/interdigital_old.jpg" style="width: 377px; height: 245px; " />Some text here.</p><a href="site.html">test</a>

after executing the regex, $m is an array with 3 empty arrays on it, but if I test it with this site it says the result is:

Array
(
    [0] => Array
        (
            [0] => <img alt="" src="/emailimg/interdigital_old.jpg" style="width: 377px; height: 245px; " />
        )

    [1] => Array
        (
            [0] => "
        )

    [2] => Array
        (
            [0] => /emailimg/interdigital_old.jpg
        )

)

What's the problem? Is it a configuration problem?

mario
  • 144,265
  • 20
  • 237
  • 291
Ivan
  • 14,692
  • 17
  • 59
  • 96
  • 5
    Obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags - in all seriousness though, [DOM](http://php.net/manual/en/book.dom.php) and [XPath](http://www.php.net/manual/en/class.domxpath.php) would definitely be better for this job. – DaveRandom Jun 08 '12 at 12:26
  • Agreed, using XPath over DOM would be the preferred way to accomplish this. – Berry Langerak Jun 08 '12 at 12:28
  • Agree, but these are small html snippets, not a complete page, it's easy to parse it using regex – Ivan Jun 08 '12 at 12:29
  • [`DOMDocument::loadHTML()`](http://www.php.net/manual/en/domdocument.loadhtml.php) doesn't need a complete page ;-) – DaveRandom Jun 08 '12 at 12:33
  • When I say this I'm refering that these are very simple pieces of html, generated by a tool, that could be parsed by one line using regex, but needs (as you have showed in your answer) 4-5 lines to do "the same" (I know that in a general case that's not the same, but in my specific case, it is). – Ivan Jun 08 '12 at 14:09

1 Answers1

4

The DOM/XPath (i.e. correct) way:

<?php

  $html = '
<p>
    <img alt="" src="/emailimg/interdigital_old.jpg" style="width: 377px; height: 245px; " />Some text here.</p><a href="site.html">test</a>
';

  $dom = new DOMDocument('1.0');
  $dom->loadHTML($html);

  $xpath = new DOMXPath($dom);

  $links = array();
  foreach ($xpath->query('//img/@src') as $img) $links[] = $img->value;
  print_r($links);

Tested and working.

EDIT

The reason your regex doesn't work is twofold:

  1. You have declared your regex using a double-quoted string. This will often result in things you do not expect and are not entirely obvious, since a double quoted string will interpolate certain escape sequences itself, before it is passed to PCRE. The problem this was causing in your case is the \1 was being interpreted as an octal character definition (as defined here), so your expression had a literal 0x01 (start of heading) character in it, instead of the \1 string that you wanted PCRE to use as a back reference.

    I find that when I have a problem like this, a good place to start is to simply echo the expression to screen to see how PHP interpolated the string you declared in your script. Here is a demonstration of that particular problem.

  2. ([\"']??) - the second question mark is breaking it. I'm not actually sure what you were trying to accomplish with this, was it just a mis-type? I'm having a little difficulty working out exactly how PCRE is interpreting this and exactly why it breaks it, but suffice it to say that it does, and the second question mark needs to go. FTR, the effect that it has is that the expression still matches the <img> tag, but the following capture group (the data you actually want) is empty.

Now let's break down the regex and see how it can be improved:

  • <\s*?img - the non-greedy * here is pointless, since \s only matches white space and the next sequence will be alphas, simply <\s*img will suffice. I'm not actually certain that HTML tags are allowed to have leading whitespace between the opening < and the tag name, but I guess it doesn't do any harm to allow it, since proper parsers probably will.
  • \s[^>]*?src=(["']??) - as previously mentioned the ?? in the capture group is breaking the expression and I'm not certain what you were trying to do with it in the first place. Also, I think again the non-greedy * is pointless because the tag will end with >, and if we haven't found src by the end it's not a match anyway. Plus if we are allowing whitespace in places where it shouldn't be but parsers will probably allow, we should probably allow it around the =. I'd rewrite this to \s[^>]*src\s*=\s*(["']?).
  • ([^"' >]*?)\1 - Assuming that you are concerned about being able to handle unquoted attributes, no complaints here. Of course if you do know that attributes will always be quoted, you could simply use ([^\1]*?)\1 and drop the ? from the preceding capture group where we determined the quote type in use.
  • [^>]*?> - no complaints here.
  • /si - the s modifier is pointless because there are no .s anywhere in the expression. It's doesn't do any harm, but neither does it help, so it is superfluous.

So, putting all that together, here's how I would write the regex:

/<\s*img\s[^>]*src\s*=\s*(["']?)([^"' >]*?)\1[^>]*>/i

...which when converted to a PHP string declaration with the quotes properly escaped, looks like this:

$expr = '/<\s*img\s[^>]*src\s*=\s*(["\']?)([^"\' >]*?)\1[^>]*>/i';

...which works nicely, by the way.

Now, I still maintain that the DOM method is better even considering the extra code, since it will probably catch edge cases that my regex skillz have forgotten about. Although admittedly regex does seem to be somewhat faster.

DaveRandom
  • 87,921
  • 11
  • 154
  • 174
  • That doesn't respond to the question, but, anyway, it's the best answer, so I accept it – Ivan Jun 08 '12 at 14:12
  • Perfect, now I understand what happened, thank you. ([\"']??) is to try to get all possibilities of single, double and no -quotes, but there is an extra question mark (I was doing some tests in order to find out what the problem was) – Ivan Jun 08 '12 at 21:25