0

I have a specific case (prefer not to use Cheerio or jsdom) and need a flexible regular expression that will find all relative paths for images—even those with invalid markup. I have two issues which can be seen at http://www.regexr.com/3bkil.

  • Not capture the beginning single and double quotes. "'
  • Not capture those that are missing a less than symbol. <

Here is what I have so far...

(?!(\<\s*img [^\>]*src\s*=\s*))[\"\']\s*[\w\.\-\/]+(\.(png|jpg|jpeg|gif))(?=(.(\"|\')|(\"|\')))

Almost there. Here are the test cases.

TO MATCH:
<img src="images/vendor.png" alt="" > <img src="images/vendor.gif" class="box-bg-image" alt="" >
<img src="images/vendor-dp-20141009-flatware.jpg" class="box-bg-image" alt="" >
<img src="images/vendor-flatware.jpeg" class="box" alt="" >
<img src='images/vendor-flatware.jpeg' class="box" >
<img alt="" src= 'images/vendor-flatware.jpeg' alt="" >
<img src=' images/vendor-flatware.jpg' alt="" >
<img src=' images/vendor-flatware.gif' alt="" >
<img src=' images/vendor-flatware.png ' alt="" >
<img src='../silverware.png' alt="" >
<img class="box" src='images/vendor-watch.png' alt="" >
<img src=" images/vendor-flatware.jpeg " alt="" >
< img  src="images/vendor-flatware.jpeg " alt="" >
< img  src="images/vendor-flatware.jpeg " alt="" >
<img src="vendor.gif" alt="">


NOT TO MATCH:
<img src="http://thirdpartycdn.com/image.jpg">
<img src='http://thirdpartycdn.com/image.png'>
<img src="http://thirdpartycdn.com/image.gif" class="box-bg-image" alt="">
img src="images/vendor-flatware.jpeg "
<img src="images/vendorpng" alt="" >

Any help would be appreciated!

Aaron
  • 98
  • 5
  • You should parse the string into HTML, then process the `src` attribute. –  Aug 21 '15 at 18:00
  • @torazaburo In the browser, yes. But I'm in Node.js and don't have a window or document object to work with. – Aaron Aug 21 '15 at 18:20

4 Answers4

1

Since Javascript doesn't have lookbehinds, I would go with this:

\<\s*img[^>]*src\s*=\s*["']([^"':]+?\.(png|jpg|jpeg|gif))

and use the content of the first capture group.

Your negative look-ahead (?!(\<\s*img [^\>]*src\s*=\s*) is useless here. (Remove it, you will see it gives the same result, because you are checking that it isn't there, and as you don't have "<img..., it's always true).

I removed the final check for ["'] because since your extensions are well defined, there isn't much of a point.

Sylverdrag
  • 8,898
  • 5
  • 37
  • 54
  • No valid reason. I worked over OP's expression and forgot to remove the escape character. – Sylverdrag Aug 21 '15 at 18:02
  • Short and easy to read, I like this one the best. Final results can be found at [jsbin](http://jsbin.com/lemezinoqe/1/edit?js,output) and [regex101](https://regex101.com/r/oM7tH8/2). I gave up on my search for an overarching regular expression after reading this [answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454). – Aaron Aug 25 '15 at 16:57
  • Good news is, I don't need to use Cheerio. – Aaron Aug 25 '15 at 17:00
0

You can use this:

(?:'|")([^"':]*?\.(?:png|jpe?g|gif)[^'"]*(?=[^<]+?>))

Regex live here.

But.. why not a DOM Parser?

0

You could try this.

((?:<\s*img[^>]+?src=\s*["']))(?!https?:\/\/)([^'"]*?\..*?)(?=["'].*?>)

Regex101 JSBin

d0nut
  • 2,835
  • 1
  • 18
  • 23
  • @WashingtonGuedes Thanks, i didn't really see it. – d0nut Aug 21 '15 at 17:54
  • Thanks for pointing out regex101.com! I think I'm going to use that tool from now on. I ran your regex through jsbin (http://jsbin.com/nosofalace/1/edit?js,output) and it removed ' – Aaron Aug 21 '15 at 20:48
  • @Aaron check my updated answer for jsbin and updated regex to help you. – d0nut Aug 21 '15 at 21:14
0

This one is work

(?:<\s*img[^>]+src\s*=['"\s]+)((?:[\w\d-\/.]+|[\w\d-]+)\.\w+)

You can get image paths via variable $1

Strategy

My strategy is separating your target pattern into two types

  1. Image path that consist of / e.g. images/vendor-flatware.png,
  2. Image path without / e.g. vendor-flatware.png.

Regex Explanation

(?:<\s*img[^>]+src\s*=['"\s]+): Find start of image tag until open symbols of file path which here dynamically match by ['"\s]+ roughly means either ' or " or space with any combination,

[\w\d-\/.]+: Match first type of file path (contain \),

[\w\d-]+: Match second type of file path (do not contain \),

\.(?:jpg|jpeg|png|gif): Match file extention.

Additional

In case that you want to use it with replace function try this

(<\s*img[^>]+src\s*=['"\s]+)([\w\d-\/.]+\.\w+|[\w\d-]+\.\w+)

Where first (...) is captured to $1 and the second (...) is captured to $2.

If you test this regex on http://jsbin.com/xebedunoki/edit?js,output you can test to use replace function like this

var newstr = strVar.replace(reg, "$1XXX");

Here, you will see that all paths will be replaced by xxx

fronthem
  • 4,011
  • 8
  • 34
  • 55
  • I have to test this further but I think I was able to accomplish the task with your regex: http://jsbin.com/jozarufela/1/edit?js,output. But, I'm very curious what the $1 does? – Aaron Aug 21 '15 at 21:33
  • `$1` just stand for message like `< img ...="` and when you want to replace it you suppose to concat it back as well otherwise you will see that all `< img ...="` are removed. – fronthem Aug 21 '15 at 21:37
  • In my regex, the first `(...)` is captured to `$1`, the second `(...)` is captured to `$2`. – fronthem Aug 21 '15 at 21:41