Regex code exception gif

Question

I have the following function that returns me the first image of the post:

$output = preg_match_all('/<img.+src=[\'"]([^\'"]+)[\'"].*>/i', 
               $post->post_content, $matches);

however returns me any image, I need to ignore the images in gif format, how could I add this condition in regex expression?

You can't parse [X]HTML with regex, because [Zalgo is coming](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)! — Oriol, Oct 09 '13 at 02:15
@Oriol There is a difference between "parsing html" and getting the content of an attribute. With regex the point of failure would be if an image tag was contained in a comment (could be acceptable), but on the other end a proper parsing solution will fail if the HTML isn't valid so the "proper solution" is also not perfect. Can't parse arbitrary HTML with regex, but for simple operations on a known format, Regex is quite workable. PS: Voting down the first question of a new user who explained clearly his problem and tried to solve it??? — Sylverdrag, Oct 09 '13 at 03:27
@Sylverdrag Yes, but asker didn't say that the html string is trusted and has always the same format, so I only wanted to warn. And the downvote is not mine. — Oriol, Oct 09 '13 at 17:08
@Oriol The asker did not say that, but he is going through the result of $post->post_content. I think it's a safe assumption that the source HTML is the HTML content of his Wordpress page. For the downvote, sorry, I suspected it wasn't you, but I was too lazy to write that observation in a separate comment. — Sylverdrag, Oct 10 '13 at 05:07

score 1 · Answer 1 · answered Oct 09 '13 at 02:43

Easier to loop through the results and use a different regex.

  $output = preg_match_all('/<img[^>]+?src=[\'"](.+?)[\'"].*?>/i', $post->post_content, $matches);
foreach ($matches as $imgSrc)
{
    if (!preg_match("/\.gif$/i"), $imgSrc)
    {
        $noGif[] = $imgSrc;
    }
}

It is easier to understand, and there won't be unexpected side effects like blocking valid pictures that happen to have the letter "gif" in the file name.

Note, be very carefull when using .+ and .*. As it stands, your regex matches a LOT more than you think:

Try it on this, for instance:

<img whatever> whatever <img src="mypic.png"> <some other tag>

+1 for not trying to put everything in one regex. – Andy Lester Oct 09 '13 at 03:59 — Andy Lester, Oct 09 '13 at 03:59

score 1 · Answer 2 · answered Oct 09 '13 at 03:57

You should probably not be using regular expressions

HTML is not regular
Regexes may match today, but what about tomorrow?

Say you've got a file of HTML where you're trying to extract URLs from tags.

<img src="http://example.com/whatever.jpg">

So you write a regex like this (in Perl):

if ( $html =~ /<img src="(.+)"/ ) {
    $url = $1;
}

In this case, $url will indeed contain http://example.com/whatever.jpg. But what happens when you start getting HTML like this:

<img src='http://example.com/whatever.jpg'>

or

<img src=http://example.com/whatever.jpg>

or

<img border=0 src="http://example.com/whatever.jpg">

or

<img
    src="http://example.com/whatever.jpg">

or you start getting false positives from

<!-- <img src="http://example.com/outdated.png"> -->

stevemarvell · Answer 3 · 2013-10-09T02:35:14.813

0

<img[^>]+src=[\'"](?:([^\'"](?!\.gif))+)[\'"][^>]*>

Updated to have only one capture.

Fixed to include dot. Now would only fail on strange things like a.gif.jpg

Also added safety matches as suggested in comment.

edited Oct 09 '13 at 02:35

answered Oct 09 '13 at 02:18

stevemarvell

981
1
6
16

.+ between img and src can have unexpected results, as can .*: ]+?src=[\'"](?:([^\'">](?!gif))+)[\'"].*?> – Sylverdrag Oct 09 '13 at 02:26
I've updated my code appropriately – stevemarvell Oct 09 '13 at 02:35

Regex code exception gif

3 Answers3