RegEx help: need to grab images (Classic ASP)

Question

I'm using the following expression in classic asp that successfully grabs any image tag with a .jpg and .png suffix.

re.Pattern = " ]*src=[""'][^ >]*(jpg|png)[""']"

The problem that I've found is many sites that I need to use do not actually use a suffix. So, I need to new regex that finds an image tag and grabs whatever is in the src attribute.

As simple as this sounds, finding an regular expression to accomplish this in Classic ASP seems impossible without writing it myself (which IS impossible).

Please advise.

Parsing HTML with Regex is a fool's task. The rest of us use DOM methods. See: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — Diodeus - James MacFarlane, May 08 '14 at 21:25
@Diodeus Agreed RegEx dosn't work well for *parsing* HTML but in this case they are reading the source and just looking for matches not parsing HTML nested structures, so I think RegEx is an acceptable method. — user692942, May 09 '14 at 09:03

signus · Accepted Answer · 2014-05-09T00:27:15.993

3

To match plainly on the img src you can do:

\<img src\=\"(\w+\.(gif|jpg|png)\")

And then if you only want the value that's in the img src, you can do a match for anything in quotes ending in a picture extension (but this may get you false positives depending on what you want):

\w+\.(gif|jpg|png)

But to match just the value while ensuring that it follows img src, you need a negative lookahead to do this (note that I added a matching group there):

(?!.*\<img src\=\")(\w+\.(gif|jpg|png))

Now to include the possibility of having image links in your image source:

(?!.*\<img src\=\")([\/\.\-\:\w]+\.(gif|jpg|png)?[\?\w+\%]+)

And then let's remove the false positives we get by fixing that lazy quantifier after (gif|jpg|png) and moving it to after the next set (which matches data you may get in a JS link, etc.) and making sure we have an end quote:

(?!.*\<img src\=\")([\/\.\-\:\w]+\.(gif|jpg|png)([\?\w+\%]+)?)(?=\")

Note: This will match this data, but regular expressions don't parse HTML, and I personally don't recommend using regular expressions to look through HTML data unless you're doing it on a case-by-case basis. If you're wanting to do some URL/Image scraping via a script, look into an XML/HTML parser.

Sample data:

<a href="myfile.htm"><img src="picture.gif"></a>
<a href="index.htm"><img src="pic859.jpg"></a>
<a href="page-57.htm"><img src="859.png"></a>
<img id="test1" class="answer1" src="text.jpg">
<img src="http://media.site.com/media/img/staff/2013/ROTHBARD-350_s90x126.jpg?e3e29f4a7131cd3bc7c4bf334be801215db5e3c2%22%3E">
<img src="yahoo.com/images/imagename.gif">

HTML Source

edited May 09 '14 at 00:27

answered May 08 '14 at 21:29

signus

1,118
14
43

This is helpful, very much. The problem is that too many times I've found that – Astralis Lux May 08 '14 at 21:40
This is a very useful piece of information to have. Luckily because I have a negative lookahead looking for `src=`, even though it it `img id`, it still matches the sample you provided. Feel free to provide other samples/examples if they do not match this criteria. – signus May 08 '14 at 22:29
I will test, but will it find the value of src with this example: Notice there is no suffix. – Astralis Lux May 08 '14 at 22:35
Yes I mentioned that I tested that, and I added it in my answer under "Sample data:". – signus May 08 '14 at 22:45
What about this? I tested it and it doesn't grab it. – Astralis Lux May 08 '14 at 23:04
Well in this case I will need to update to include links, as I did not consider that possibility before. Updated my answer. – signus May 08 '14 at 23:07
Another thing, I should have been clear so I apologize. I need the full path in the src. It appears that this is stripping the full path from src. So will give http://www.yahoo.com/images/imagename or will give /images/imagename – Astralis Lux May 08 '14 at 23:08
It wasn't that it was stripping it, it was that I didn't consider that possibility. I updated the answer to include this functionality. And as far as your example ` ` is concerned, there is no "gif", "png" or "jpg". Check my sample data above where I included that, but with .gif. – signus May 08 '14 at 23:11
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/52341/discussion-between-signus-and-astralis) – signus May 08 '14 at 23:13
Thank you! OP, feel free to accept my answer if this answered all of your questions. :) – signus May 12 '14 at 17:57

RegEx help: need to grab images (Classic ASP)

1 Answers1