-4

I am trying to match every src attribute that ends with jpg or png or gif and extract src string inside. I am not sure if the following regex that I came up with is correct, but it does give me src attributes with addresses. My question has to do with the possible problem of the following regex and how I can extract only the src string.

/src\s*=\s*(["'][^"']+(jpg|png|gif)\b)/g;
sawa
  • 1,930
  • 3
  • 24
  • 33
  • I'm voting to close this question as off-topic because it is asking for a code review. It might be on-topic (after some editing) on [this sister site](http://codereview.stackexchange.com/help/on-topic). – Quentin May 05 '16 at 09:23
  • Why are people voting down? Let me know so that I can rephrase my question? – sawa May 05 '16 at 09:24
  • I don't see the clear distinction between asking for a code review and asking about a programming problem that I am not well versed in. Can anyone explain so that I can understand? – sawa May 05 '16 at 09:28
  • You said the code works, so you don't have a problem. You just have doubts about the quality of the code. – Quentin May 05 '16 at 09:32
  • If you only look at the related questions on the right corner, there are bunch of questions in the similar format. I cannot accept your accusation without understanding the validity of your accusation myself. – sawa May 05 '16 at 09:33
  • On top of that, I am asking how I can extract only the src attribute out of the regex above. Who said I do not have a problem? – sawa May 05 '16 at 09:38

1 Answers1

1

First of all, your regex is trying to do too much. Start by doing something like:

function img_find() {
    var imgs = document.getElementsByTagName("img");
    var imgSrcs = [];

    for (var i = 0; i < imgs.length; i++) {
        imgSrcs.push(imgs[i].src);
    }

    return imgSrcs;
}

Now, your regex has a lot less to deal with. (No whitespace, single vs double quotes, and so on.)

Please read this, and don't (except for very simple situations) try to use regex for parsing raw HTML :)

So, given an array of image sources, you just need to select the jpg/png/gif ones:

/(jpg|png|gif)$)/i;

And then grab their file names, without the extension: (There are many ways of doing this; here's just one thing I've thrown together...)

/(.*)\.[^.]+)/;
Community
  • 1
  • 1
Tom Lord
  • 27,404
  • 4
  • 50
  • 77
  • Thanks for your explanation. The reason I am using regex is that I am trying to collect image links that are embedded within javascript code, which I cannot parse for image tags. Since I am not familiar with regex syntax, I am still not sure how I can extract only the address part after matching 'src' in the beginning. I am sure that this has to do with the basics, but could you please explain me on this? – sawa May 05 '16 at 09:54
  • `javascript code cannot be parsed for image tags` -- Yes, it can. Regex is not the right answer to this problem, because of the aforementioned issues such as white-space, single vs double quotes, and so on. Use the DOM, as I suggested, to get the image source. *Then* use a regex. Any pure-regex solution will have annoying edge case bugs, and will be extremely difficult to read and understand. – Tom Lord May 05 '16 at 10:19
  • Or if you want to ignore my advice completely, then just use: `/src\s*=\s*(["']([^"']+)\.(jpg|png|gif)\b)/g;`, and the second match group will contain the file name. But as I keep saying, there are a hundred ways that this could go wrong... For example, what about a file named `thisisnota.png.exe`? Or, what if unicode quotes(https://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html) are used? Or, what if the file is named `"file_with_a_'_character.png"`? ..... None of these things would be a problem, if you did it properly as I suggested. – Tom Lord May 05 '16 at 10:25
  • I think I misled you by not clarifying myself. The javascript code I have does not have img tag itself. For example, I need to extract src from codes such as : ... when('A').register("ImageBlockATF", function(A){ var data = { 'colorImages': { 'initial': [{"hiRes":"http://ecx.images-zone.com/images/I/d15sdgL._wgg1500_.jpg" ... which I believe is not possible to extract without regex. – sawa May 05 '16 at 10:29
  • Thanks for your advice, though. I understand that it is not recommended to parse html with regex. For my use case, however, I will try the regex you gave me. – sawa May 05 '16 at 10:45