Inaccurate preg_match with '.jpg' pattern

Question

I am usng preg_match with the pattern $pattern = '/src="http:\/\/(.*?).jpg"/s'; to grab urls of jpeg images off a webpage. However, this is not accurate enough as it also grabs http://www.domain.com/image.png"> Yadayada <img src="anotherpic.jpg.

Other times, it grabs stuff like

http://maps.google.com/maps/api/staticmap?center=42.34,-71.18&path=weight:4|42.338,-71.177|42.338,-71.183|42.342,-71.183|42.342,-71.177|42.338,-71.177&zoom=15&size=335x225&sensor=false" width="280" height="188" alt=""></td></tr> <tr><td height="10"></td></tr></table></td></tr></table></td></tr><tr><td height="10 valign="> </td></tr><tr><td valign="top" background="http://www.coolapartments.info/img/java-footer_bg.jpg

How can I improve the pattern to prevent unwanted matching like the 2 examples above?

Possibly worth mentioning this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — erisco, Oct 19 '11 at 05:49
@erisco its not worth mentioning because the funny answer is wrong. — Gordon, Oct 19 '11 at 05:54
Odd criteria for discarding the entire discussion that took place. — erisco, Oct 19 '11 at 05:57
@erisco no one reads the discussion that took place. that answer is nowadays only used to slap down all the parse html with regex questions. we have better canonicals for that purpose. — Gordon, Oct 19 '11 at 06:07
Oh, I get it. Look here then: http://stackoverflow.com/questions/3577641/best-methods-to-parse-html-with-php/3577662#3577662 — erisco, Oct 19 '11 at 06:11
I don't understand why this question gets downvotes because it is not wrong to use regexes to parse HTML. Why ? Because HTML is not XML (bye bye XPath) and sometimes you don't want to traverse a super complex deep tree just to find images. You don't know nothing about its structure and regexes allow you to ignore structure to focus on the lexical stuff. Sometimes it is just the right tool. — Ludovic Kuty, Oct 19 '11 at 06:11
@LudovicKuty because it doesnt show research effort - on a sidenote: while regex can be used for matching strings in HTML documents, I think they are not the right tool here, because they dont distinguish between attributes, elements and text and the OP apparently wants to match on attributes only. It's also incorrect that you cannot use XPath here because DOM can parse broken HTML and apply XPath queries to that. — Gordon, Oct 19 '11 at 06:15
@Gordon its not that I dont make any attempts at searching, its just that I don't know the correct keywords/phrases to use while searching especially since I am confused myself. Ludovic's answer works perfectly for my needs! Thanks! — Nyxynyx, Oct 19 '11 at 06:21
@Nyx oh, I believe you attempted searching. But you probably skipped the *effort* part ;) see, you dont need any special keywords: [match+all+urls+from+html+php](http://stackoverflow.com/search?q=match+all+urls+from+html+php) has lots of helpful posts already. And going through a few of those will likely make you discover better keywords or even allow you to figure it out on your own. — Gordon, Oct 19 '11 at 06:28
@Gordon You are right that a DOM tree can be build from an HTML page and then searched with XPath. I forgot that. XPath is then particularly handy. — Ludovic Kuty, Oct 19 '11 at 08:12

score 3 · Accepted Answer · answered Oct 19 '11 at 05:42

3

Replace the (.*?).jpg by ([^"]*)\.jpg to avoid crossing the double quote boundary of the src attribute. It could even be more generic with src="([^"]*)\.jpg", without matching the http.

answered Oct 19 '11 at 05:42

Ludovic Kuty

4,868
3
28
42

You could restrict the character class to also not allow single quote or wedge brackets. – tripleee Oct 19 '11 at 06:04

score 2 · Answer 2 · edited May 23 '17 at 11:55

Use DOM and this XPath

//@src[contains(,. '.jpg')]

to match all src attributes of elements that contain the string ".jpg" somewhere.

If the attribute should end in ".jpg" use

//@src[substring(., string-length(.) - 4) = '.jpg']

which is the equivalent to the XPath 2.0 function ends-with.

The main benefit of using DOM and XPath is that it will only operate on src attributes, while your regex matches everywhere. There is plenty of usage examples for DOM and XPath here:

https://stackoverflow.com/search?q=xpath+OR+dom+php

Inaccurate preg_match with '.jpg' pattern

2 Answers2