Deciphering a Regex

Question

Please can someone help me understand this regular expression used to match src attributes of img tags in HTML?

src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))


src=                               this is easy
(?:(['""])(?<src>(?:(?!\1).)*)     ?: is unknown (['""]) matches either single or double quotes, followed by a named group "src" that matches unknown strings
\1                                 unknown
|                                  "or"
(?<src>[^\s>]+))                   named group "src" matches one or more of line start or whitespace

In brief what does ?: mean?

So (?:...) is a non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

Thanks @mbratch

what does \1 mean?

And finally, does the exclamation mark have any special significance here? (negation?)

Parsing the inside of a single tag with regex is actually probably fine — murgatroid99, May 29 '13 at 13:25
[http://stackoverflow.com/a/1732454/674700](http://stackoverflow.com/a/1732454/674700) — Alex Filipovici, May 29 '13 at 13:25
(?:...) is a non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern. — lurker, May 29 '13 at 13:25
@Alex that is the most famous post on here. I have seen it. I am aware! — Ben Aston, May 29 '13 at 13:26
`\1` (or in context, `(?!\1)` I think references back to the match for `(['""])`, and in the context means "anything but what was matched previously with `['""]`". — lurker, May 29 '13 at 13:39

score 3 · Answer 1 · edited Feb 08 '17 at 14:41

3

This may help you understand the regex.

(?:(['""])((?:(?!\1).)*)\1|([^\s>]+))

Regular expression image

Edit live on Debuggex

edited Feb 08 '17 at 14:41

Community

1
1

answered May 29 '13 at 13:29

John Sobolewski

4,512
1
20
26

score 2 · Accepted Answer · answered May 29 '13 at 13:34

For an example, consider src="img.jpg" as the text we are parsing

In a regex, \1 refers to the first capturing group. In this particular case, the first capturing group is (['""]). The section (?:(['""])(?<src>(?:(?!\1).)*) is a non-capturing group that matches "img.jpg in our example. In particular, (['""]) matches any quote character. Then (?!\1) is a negative lookahead for the quote character matched in the first group, so (?:(?!\1).) matches any character that is not the quote character matched by the first group and (?<src>(?:(?!\1).)*) matches, in a named capturing group, a sequence of characters that precedes the end quote character. Then the following \1 matches the end quote character.

Martin Ender · Answer 3 · 2013-05-29T13:43:06.917

src=      # matches literal "src="
(?:       # the ?: suppresses capturing. generally a good practice if capturing
          # is not explicitly necessary
  (['"])  # matches either ' or ", and captures what was matched in group 1
          # (because this is the first set of parentheses where capturing is not
          # suppressed)
  (?<src> # start another (named) capturing group with the name "src"
    (?:   # start non-capturing group
      (?!\1)
          # a negative lookahead, if its contents match, the lookahead causes the
          # pattern to fail
          # the \1 is a backreference and matches what was matched in capturing
          # group no. 1
    .)*   # match any character, end of non-capturing group, repeat
          # summary of this non-capturing group: for each character, check that
          # it is not the kind of quote we matched at the start. if it's not,
          # then consume it. repeat as long as possible.

  )       # end of capturing group "src"
  \1      # again a backreference to what was matched inside capturing group 1
          # i.e. match the same kind of quote that started the attribute value
|         # or
  (?<src> # again a capturing group with the name "src"
    [^\s>]+
          # match as many non-space, non-> character as possible (at least one)
  )       # end of capturing group. this case treats unquoted attribute values.
)         # end of non-capturing group (which was used to group the alternation)

Some further reading for you:

If you want to refresh your regex knowledge a bit, I recommend reading through the entire tutorial. It's definitely worth your time.

A few more resources to get help with understanding complicated expressions:

Regex 101 generates an explanation from a regex. However, it uses PHP's PCRE engine, so it will choke on some of the .NET features like repeated named capturing groups (in your case the src).
Debuggex which lets you step through the regex and generates a flowchart. So far it's regex flavor is even more limited though (to JavaScript's ECMAScript flavour)
Regexper which focuses on the flowchart tihng. As of now it's also limited to JavaScript regex flavor, though.

Anirudha · Answer 4 · 2013-05-29T15:14:54.643

1

1>It first captures any 1 of ['""] in group 1 i.e (['""])

2>Then it matches 0 to many character which is not the one captured in group 1 i.e (?:(?!\1).)*

3>It does step 2 till it matches the one captured in group 1 i.e \1

The above 3 steps is similar to (['""])[^\1]*\1

OR

1>it matches all the non space,> characters after src= i.e [^\s>]+

NOTE I would use src=(['""]).*?\1

.* is greedy,it matches as much as it can..

.*? is lazy,it matches as less as it can..

For example,consider this string hello hi world

for regex ^h.*l output would be hello hi worl

for regex ^h.*?l output would be hel

edited May 29 '13 at 15:14

answered May 29 '13 at 13:29

Anirudha

32,393
7
68
89

Please can you explain the `*?` part of your suggested regex? – Ben Aston May 29 '13 at 14:30

score 1 · Answer 5 · answered May 29 '13 at 13:29

I used RegexBuddy to get this output:

Match the characters “src=” literally «src=»
Match the regular expression below «(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))»
   Match either the regular expression below (attempting the next alternative only if this one fails) «(['""])(?<src>(?:(?!\1).)*)\1»
      Match the regular expression below and capture its match into backreference number 1 «(['""])»
         Match a single character present in the list “'"” «['""]»
      Match the regular expression below and capture its match into backreference with name “src” «(?<src>(?:(?!\1).)*)»
         Match the regular expression below «(?:(?!\1).)*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
            Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\1)»
               Match the same text as most recently matched by capturing group number 1 «\1»
            Match any single character that is not a line break character «.»
      Match the same text as most recently matched by capturing group number 1 «\1»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «(?<src>[^\s>]+)»
      Match the regular expression below and capture its match into backreference with name “src” «(?<src>[^\s>]+)»
         Match a single character NOT present in the list below «[^\s>]+»
            Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
            A whitespace character (spaces, tabs, line breaks, etc.) «\s»
            The character “>” «>»

This Regex is very bad one for what you described. src=" is a valid input.

Deciphering a Regex

5 Answers5