0

Here's an example html string:

PS: Notice how the string can have any random attributes in an image, some images can close with "/>", some with ">". It shouldn't matter. The regex should filter all the noise and capture all images src in an array.

The answers given in stackoverflow don't account for spaces inside the image tag and attributes between

<div>
  <div>
    <div>
      <img   title=  "SOME TITLE" src="SOME IMAGE" alt="SOME ALT" />
      <img   alt="SOME ALT" title="SOME TITLE" src=   "SOME IMAGE"     >
    </div>
    <img src="SOME IMAGE">
  </div>

  <div>
    <img alt   ="SOME ALT" src=  "SOME IMAGE" title="SOME TITLE">
  </div>

  <img   src="SOME IMAGE" alt="SOME ALT" title="SOME TITLE" />
  < img src  ="SOME IMAGE" alt="SOME ALT" title="SOME TITLE"    />
</div>

I'm looking for code like this:

var pictures = [],
  m,
  rx = /SOME REGEX/g;

while (m = rx.exec(str)) { //str being the html string of any sort
  pictures.push(m[SOME INDEX]); //m[SOME INDEX] to match the value of src attribute
}
abelabbesnabi
  • 1,819
  • 1
  • 16
  • 21
  • Why do you need a regex? Why not use the HTML parsing functionality built-in to basically every JS engine, and let it extract the images? HTML is not a regular language; parsing it with regexes [is a terrible idea](http://stackoverflow.com/a/1732454/364696). – ShadowRanger May 04 '17 at 02:26
  • No, it's not. The answers given there don't solve the issue. Here's a fiddle: https://jsfiddle.net/mL3d6h7q/1/ – abelabbesnabi May 04 '17 at 02:43
  • @phatfingers, I need your help please. I asked specifically for regex because I can't use anything people are proposing. I can only use regex and consider the html a string and not a document. The answer is simply like this: var pictures = [], m, rx = /SOME REGEX/g; while (m = rx.exec(str)) { //str being the html string of any sort pictures.push(m[SOME INDEX]); //m[SOME INDEX] to match the value of src attribute } – abelabbesnabi May 04 '17 at 12:26
  • @abelabbesnabi essentially it is the same question. you should be looking how to adjust the regex for whitespace. – Daniel A. White May 05 '17 at 20:12

4 Answers4

2

This is probably what you need. But I don't understand why you have to use regex. So lets focus on this example. First, we must do more validation in order to improve it.

The basic idea is that we add a class to the container div then we can use body tag also. But I recommend to make it more granular. Pick the element that contains all the img tags. Then capture their inner HTML and apply the regex to that string. Also, I recommend to use selectQueryAll it is more simple.

var pictures = [],
  m;
var str = document.getElementById('container').innerHTML,
    rex =  /<img[^>]+src="?([^"\s]+)"?\s*/gi;

while (m = rex.exec( str )) {
    pictures.push( m[1] );
}


var output = document.getElementById('output');
var index = 0;
pictures.forEach(function(picture){
  var pTag = document.createElement('p');
  pTag.innerHTML = '[' + index++ + '] ' + 'img tag found. URL extacted -> ' + picture;
  output.appendChild(pTag);
})
<div id="container">
  <div>
    <div>
      <img title="SOME TITLE" src="http://i.imgur.com/1B0mUM2.jpg" alt="SOME ALT" />
      <img alt="SOME ALT" title="SOME TITLE" src="http://i.imgur.com/UWWQ0Wr.jpg">
    </div>
    <img src="http://i.imgur.com/UWWQ0Wr.jpg">
  </div>

  <div>
    <img alt="SOME ALT" src="http://i.imgur.com/UWWQ0Wr.jpg" title="SOME TITLE">
  </div>

  <img src="http://i.imgur.com/1B0mUM2.jpg" alt="SOME ALT" title="SOME TITLE" />
  <img src="http://i.imgur.com/UWWQ0Wr.jpg" alt="SOME ALT" title="SOME TITLE" />
</div>
<div id="output"></div>
Teocci
  • 7,189
  • 1
  • 50
  • 48
0

I think I have a pattern for you. Covers http/https/ftp/ftps or just //.

(http|ftp|\/{2})?s?:?\/{2}(.*[^\s]+)\.(jp?eg|png|gif)\s
Sterling Beason
  • 622
  • 6
  • 12
  • It shouldn't matter what's inside the src, all I need is extract it. I sugest you run my code in https://jsfiddle.net/ with your expression and the index for the match. – abelabbesnabi May 04 '17 at 02:28
0

I'm doing this:

var
  uri = response.request.uri, //Coming from node
  pictures = [],
  r = /src="?([^"\s]+)(jp?g|png|gif)"/g,
  m;

while (m = r.exec(html)) {
  if (!m[1].startsWith('data:')) {
    if (!m[1].startsWith('http')) {
      m[1] = uri.protocol + '//' + uri.host + '/' + m[1]
    }

    pictures.push(src: m[1] + m[2]);
  }
}
abelabbesnabi
  • 1,819
  • 1
  • 16
  • 21
  • See my answer, Your regex contains capturing (jp?g) parentheses which would match jpg, jg but not jpeg. We could have many file extensions for the src file. We can try with more general regex /src(\s*)=(\s*)"([^\s]*)"/g – Sandeep Sharma May 05 '17 at 20:10
0

Try with following :

  /**
   * 
   * 1. src :- match will start by src
   * 2. (\s*) :- might be followed by 0 or more spaces
   * 3. =  :- then we definitely have =
   * 4. (\s*) :- might be followed by 0 or more spaces
   * 5. " :- then we will have "
   * 6. ([^\s]*) :- might be followed by 0 or more characters except space
   * 7. " :- finally we would have closing "
   */
var re = /src(\s*)=(\s*)"([^\s]*)"/g;

var str = "src=\"http://bsfsd1.png\" xyz  a src= \"http://bsfsd2.xyz\" axy src=   \"http://bsfsd3.png\" abc src   =  \"http://bsfsd4.png\" sandeep ";

var xArray; 
var pictures = [];
while(xArray = re.exec(str)){
  pictures.push(xArray[3]);
}
console.log(pictures);
Sandeep Sharma
  • 1,855
  • 3
  • 19
  • 34