3

Using the official spec for the HTML5 srcset image candidate string, I've created the following regex:

/<img[^\>]*[^\>\S]+srcset=['"](?:([^"'\s,]+)\s*(?:\s+\d+[wx])(?:,\s*)?)+["']/gm

...Which should match the following tag:

<img srcset="image@2x.png 2x, image@4x.png 4x, image@6x.png 6x">

...And return the three filenames specified (image@2x.png, image@4x.png, image@6x.png).

However, even though it matches, it's only returning the last one. See this Regex101 demo.

What am I doing wrong?

aendra
  • 5,286
  • 3
  • 38
  • 57
  • Why don't you use html parsers? – Avinash Raj Oct 20 '14 at 13:24
  • @AvinashRaj It's part of a pull request I'm doing to `grunt-imagemin`, which (alas) uses regex. Insert link to that classic "WTF are you parsing HTML with regex?!" answer here... – aendra Oct 20 '14 at 13:36
  • Obligatory self-link to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – aendra Oct 20 '14 at 16:28

1 Answers1

10

As you can see in this visualization, the capture-group parenthesis are inside a repeated pattern. This causes the regex to only return the last one.

<img[^\>]*[^\>\S]+srcset=['"](?:([^"'\s,]+)\s*(?:\s+\d+[wx])(?:,\s*)?)+["']

Regular expression visualization

Debuggex Demo

Regexes can't return multiple instances of the same capture group. What you need to do is capture the entire thing and then examine it further to get the individual filenames:

<img[^\>]*[^\>\S]+srcset=['"]((?:[^"'\s,]+\s*(?:\s+\d+[wx])(?:,\s*)?)+)["']

Regular expression visualization

Debuggex Demo

asontu
  • 4,548
  • 1
  • 21
  • 29
  • 2
    Good answer. I realised JavaScript regex can't repeat capture groups after posting. Really, the solution is "Don't use regex to parse HTML you fools." – aendra Oct 20 '14 at 16:16