5

I am already parsing pages with the HtmlAgilityPack, and getting most img sources. However many websites include img urls in places other than the img src attributes (e.g. inlined javascript, a different attribute, a different element). I would like to cast a slightly wider net and run a regex on the entire html string capture the following in a regex.

  1. Must begin with http://, https://, //, or /
  2. Then, any number of valid url path characters
  3. Must end with either, .jpeg, .jpg, .png, or .gif

I imagine this would be simple to write, however I am not an awesome regexer. I imagine the parts would look like this

  1. ^((https?\:\/\/)|(\/{1,2}))
  2. (any ideas?)
  3. (.(jpe?g|png|gif))$

Can anyone help me fill the blanks?

Thanks

Answer

(https?:)?//?[^\'"<>]+?\.(jpg|jpeg|gif|png)
Adrian Adkison
  • 3,537
  • 5
  • 33
  • 36
  • Why don't you just use `.*?` in the middle? – alex May 30 '11 at 05:59
  • ^((https?\:\/\/)|(\/{1,2})).*?(.(jpe?g|png|gif))$ like this? I will give it a try – Adrian Adkison May 30 '11 at 06:03
  • Here is a real example of what I am trying to do http://www.forever21.com/product.asp?catalog_name=FOREVER21&category_name=acc_handbags&product_id=1075808150&Page=all&pgcount=25&cookie_test=1 if you view source of this link "/images/thumbnail/75808150-01.jpg" is in the inlined javascript, I want this to show up in my mathes – Adrian Adkison May 30 '11 at 06:20
  • possible duplicate of [Regex to check if valid URL that ends in .jpg, .png, or .gif](http://stackoverflow.com/questions/169625/regex-to-check-if-valid-url-that-ends-in-jpg-png-or-gif) – Muhammad Hasan Khan May 30 '11 at 06:29
  • Quite a few websites deliver image content as SVG. Others don't put suffixes, encoding the information directly in the response metadata. Your plan is incomplete (and probably impossible _to_ complete). – Donal Fellows May 30 '11 at 08:44
  • @Donal Fellows Just because a program does not work for 100% of websites does not make it incomplete. What if you are only trying to do this for retailers. How many do you know that use SVG? My point is that we are covering 90% of websites people link to from our website using HtmlAgilitPack. In those cases were getting images that are absolute paths, protocol relative, route relative, file relative, with query strings, and many more cases. I said I wanted to cast a slightly wider net and With the help of @erisco I was able to capture good images at least 20 more very commonly linked sites. – Adrian Adkison May 30 '11 at 21:59

2 Answers2

8

There are a number of ad-hoc regular expressions for matching URLs out there, but none that I am aware of claim total reliability. However, this one will attempt to satisfy your conditions.

According to [1], valid URL characters (which are not reserved) are alphanumeric and the symbols $-_.+!*'(),. However, there are reserved characters as well, which are +/?%#& which is concisely given by [2] -- I couldn't find a list in the bulk of the RFC. I know there are other characters used for query strings though, namely =;, so those need inclusion. Then you run into issues that not everyone properly encodes their URL characters, so spaces may be present among other things (which I do not know how to account for as how a browser auto-corrects things can be mystifying).

Therefore, you might just assume that anything can be in a URL, but merely it must start with something particular and end with something particular (which you provided) but this is still unreliable.

@(https?:)?//?[^'"<>]+?\.(jpg|jpeg|gif|png)@

erisco
  • 14,154
  • 2
  • 40
  • 45
  • this is pretty good, how would you change the ".+?" section to exclude single quote, double quote , <, and >. I realize they are possible, but not very likely to show up in a path and would really clean up the results. – Adrian Adkison May 30 '11 at 06:26
  • You would introduce a character class that blacklists characters, which is done by leading with the ^ character. I've adjusted my answer. – erisco May 30 '11 at 06:30
  • Wow! This works great! It only seems to be matching gifs though. http://www.forever21.com/product.asp?catalog_name=FOREVER21&category_name=acc_handbags&product_id=1075808150&Page=all&pgcount=25&cookie_test=1 I am testing it on this page's source. Dont ask why :) – Adrian Adkison May 30 '11 at 06:44
  • Well, that would make sense, as gif is the only type of image on the page -- or at least in the source code. – erisco May 30 '11 at 06:45
  • Another note, you will have to watch out for dynamic images. For example, `http://www.example.com/image.png?name=Adrian` might generate a unique image with the text "Adrian" on it. However, the regular expression will ignore the query string, giving you only `http://www.example.com/image.png`. Again, it is very difficult to be reliable with this. – erisco May 30 '11 at 06:48
  • I totally agree. I am willing to miss out on the query strings. Like I said, I am already getting every img src attribute with HtmlAgilityPack and that is working great. I just want to also be able to pick up some other img srcs that are in the inline javascript or in different places in the html. There are other images in the source code. do a search for /images/thumbnail/75808150-01.jpg – Adrian Adkison May 30 '11 at 06:52
  • Sorry, must have typed jpg wrong when I searched for it. Well, I ran the regexp locally over the page's source and it matches the jpg's as well. I cannot reason why it wouldn't. Your call looks something like `preg_match_all('@(https?:)?//?[^\'"<>]+?\.(jpg|jpeg|gif|png)@', $contents, $matches);` yes? – erisco May 30 '11 at 07:00
  • Erisco, thanks a bunch for your help. It is working great. I am pulling the source server side and am getting a different file than the expected one hence no jpgs. – Adrian Adkison May 30 '11 at 07:15
0
(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*\.(?:jpg|gif|png))(?:\?([^#]*))?(?:#(.*))?
Peyman
  • 3,068
  • 1
  • 18
  • 32