0

I have the following function that returns me the first image of the post:

$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', 
               $post->post_content, $matches);

however returns me any image, I need to ignore the images in gif format, how could I add this condition in regex expression?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
  • 4
    You can't parse [X]HTML with regex, because [Zalgo is coming](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)! – Oriol Oct 09 '13 at 02:15
  • @Oriol There is a difference between "parsing html" and getting the content of an attribute. With regex the point of failure would be if an image tag was contained in a comment (could be acceptable), but on the other end a proper parsing solution will fail if the HTML isn't valid so the "proper solution" is also not perfect. Can't parse arbitrary HTML with regex, but for simple operations on a known format, Regex is quite workable. PS: Voting down the first question of a new user who explained clearly his problem and tried to solve it??? – Sylverdrag Oct 09 '13 at 03:27
  • @Sylverdrag Yes, but asker didn't say that the html string is trusted and has always the same format, so I only wanted to warn. And the downvote is not mine. – Oriol Oct 09 '13 at 17:08
  • @Oriol The asker did not say that, but he is going through the result of $post->post_content. I think it's a safe assumption that the source HTML is the HTML content of his Wordpress page. For the downvote, sorry, I suspected it wasn't you, but I was too lazy to write that observation in a separate comment. – Sylverdrag Oct 10 '13 at 05:07

3 Answers3

1

Easier to loop through the results and use a different regex.

  $output = preg_match_all('/<img[^>]+?src=[\'"](.+?)[\'"].*?>/i', $post->post_content, $matches);
foreach ($matches as $imgSrc)
{
    if (!preg_match("/\.gif$/i"), $imgSrc)
    {
        $noGif[] = $imgSrc;
    }
}

It is easier to understand, and there won't be unexpected side effects like blocking valid pictures that happen to have the letter "gif" in the file name.

Note, be very carefull when using .+ and .*. As it stands, your regex matches a LOT more than you think:

Try it on this, for instance:

<img whatever> whatever <img src="mypic.png"> <some other tag>
Sylverdrag
  • 8,898
  • 5
  • 37
  • 54
1

You should probably not be using regular expressions

  • HTML is not regular
  • Regexes may match today, but what about tomorrow?

Say you've got a file of HTML where you're trying to extract URLs from tags.

<img src="http://example.com/whatever.jpg">

So you write a regex like this (in Perl):

if ( $html =~ /<img src="(.+)"/ ) {
    $url = $1;
}

In this case, $url will indeed contain http://example.com/whatever.jpg. But what happens when you start getting HTML like this:

<img src='http://example.com/whatever.jpg'>

or

<img src=http://example.com/whatever.jpg>

or

<img border=0 src="http://example.com/whatever.jpg">

or

<img
    src="http://example.com/whatever.jpg">

or you start getting false positives from

<!-- <img src="http://example.com/outdated.png"> -->
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
0
<img[^>]+src=[\'"](?:([^\'"](?!\.gif))+)[\'"][^>]*>

Updated to have only one capture.

Fixed to include dot. Now would only fail on strange things like a.gif.jpg

Also added safety matches as suggested in comment.

stevemarvell
  • 981
  • 1
  • 6
  • 16