Regular Expressions + preg_match_all - Getting the value of an attribute

Question

I'm trying to get the value of the href attribute of the first <a> tag in a post which is an image.
This is what I have so far:

$pattern = "/<a.+href=('|\")(.*?).(bmp|gif|jpeg|jpg|png)('|\").*>/i";
$output = preg_match_all($pattern, $post->post_content, $matches);
$first_link = $matches[1][0];

However, this does not work.

I have a code to get the src value of an <img> tag which does work:

$pattern = "/<img.+src=[\'"]([^\'"]+)[\'"].*>/i";
$output = preg_match_all($pattern, $post->post_content, $matches);
$first_img = $matches[1][0];

As I'm no expert with regular expressions and php in general I have no idea what I'm doing wrong.

Also I couldn't find any decent, organized guide to regular expressions so a link to one could be useful as well!

[Here is the link you asked for](http://www.regular-expressions.info/tutorial.html). If you read through this tutorial your grasp on regular expressions will greatly increase. [But here is why you should rethink your overall approach](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). And it might help if you showed us your input. Finally, your second example cannot possibly work because you use `"` for your string, but escape the `'` inside it. — Martin Ender, Dec 08 '12 at 16:55
Thanks for the link - I'll give it a read. As for why I should rethink my approach - nice answer, it was quite funny. Unfortunately, this is the only way I know. Please let me know if there's any other way to do it more efficiently. As for the second example; originally the `$pattern` in the second example was directly embedded within the `$output` line and I just moved it up for easier comparison between the examples. This also might be the reason for the first example malfunction, am I right? — Asaf, Dec 08 '12 at 17:08
Yes, the two answers you have now are basically the two options you have. If you can use the library I linked, the code becomes a lot cleaner and easier to read. If not, GoogleGuy's approach is the way to go. — Martin Ender, Dec 08 '12 at 17:12

score 3 · Answer 1 · answered Dec 08 '12 at 17:09

This isn't a problem you should be solving with regular expressions. If you want to parse HTML, what you need is an HTML parser and PHP already has one for you that works great!

$html = <<<HTML
<a href="http://somesillyexample.com/some/silly/path/to/a/file.jpeg">
HTML;

$dom = new DomDocument;
$dom->loadHTML($html); // load HTML from a string
$elements = $dom->getElementsByTagName('a'); // get all elements with an 'a' tag in the DOM
foreach ($elements as $node) {
    /* If the element has an href attribute let's get it */
    if ($node->hasAttribute('href')) {
        echo $node->getAttribute('href') . "\n";
    }
}
/*
will output:

http://somesillyexample.com/some/silly/path/to/a/file.jpeg
*/

See the DOMDocument documentation for more details.

score 2 · Accepted Answer · answered Dec 08 '12 at 17:06

2

You should use a DOM parser for this. If you can use 3rd party libraries, check out this one. It makes your task incredibly easy:

$html = new simple_html_dom();
$html->load($post->post_content);

$anchor = $html->find('a', 0);
$first_link = $anchor->href;

If you cannot use this library for one reason or another, using PHP's built-in DOM module is still a better option than regular expressions.

answered Dec 08 '12 at 17:06

Martin Ender

43,427
11
90
130

Thanks. I guess I will use your advice. The only problem I see with it is that this applies to **all** links. The first link could as well be a text link and not and image link... – Asaf Dec 08 '12 at 17:12
By the way, my logic in doing it is that wordpress re-sizes images and saves them as new, smaller than the original images. Because I want to get the first image of a post and use it as a thumbnail, the `` tag approach could cause thumbnails to come out blurry. That's why I wanted to use the `` tags that link to the original images. – Asaf Dec 08 '12 at 17:16
1

Nevermind! I just read the documentation. `[attribute*=value]` is my answer :) – Asaf Dec 08 '12 at 17:18

score 1 · Answer 3 · answered Dec 08 '12 at 17:15

Just some notes about your regular expression:

 "/<a.+href=('|\")(.*?).(bmp|gif|jpeg|jpg|png)('|\").*>/i"
      ^ that's greedy, should be +?
      ^ that's any char, should be not-closing-tag character: [^>]

 "/<a.+href=('|\")(.*?).(bmp|gif|jpeg|jpg|png)('|\").*>/i"
            ^^^^^^ for readability use ['\"]

 "/<a.+href=('|\")(.*?).(bmp|gif|jpeg|jpg|png)('|\").*>/i"
                       ^ that's any char, you might wanted \.

 "/<a.+href=('|\")(.*?).(bmp|gif|jpeg|jpg|png)('|\").*>/i"
                    ^^ that's ungreedy (good!)       ^ see above (greedy any char)

I can't test it now as i don't have PHP here, but correct these issues and maybe your problem is already solved. Also check the pattern modifier /U which toggles the default "greedyness".

This problem however has been solved many times so you should use the existing solutions (a DOM parser). For example you're not permitting quotes in the href (which is probably ok for href but later you'll copy + paste your regex for parsing another html attribute where quotes are valid characters).

Regular Expressions + preg_match_all - Getting the value of an attribute

3 Answers3