0

Please can someone help me understand this regular expression used to match src attributes of img tags in HTML?

src=(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))


src=                               this is easy
(?:(['""])(?<src>(?:(?!\1).)*)     ?: is unknown (['""]) matches either single or double quotes, followed by a named group "src" that matches unknown strings
\1                                 unknown
|                                  "or"
(?<src>[^\s>]+))                   named group "src" matches one or more of line start or whitespace

In brief what does ?: mean?

So (?:...) is a non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

Thanks @mbratch

what does \1 mean?

And finally, does the exclamation mark have any special significance here? (negation?)

Anirudha
  • 32,393
  • 7
  • 68
  • 89
Ben Aston
  • 53,718
  • 65
  • 205
  • 331

5 Answers5

3

This may help you understand the regex.

(?:(['""])((?:(?!\1).)*)\1|([^\s>]+))

Regular expression image

Edit live on Debuggex

Community
  • 1
  • 1
John Sobolewski
  • 4,512
  • 1
  • 20
  • 26
2

For an example, consider src="img.jpg" as the text we are parsing

In a regex, \1 refers to the first capturing group. In this particular case, the first capturing group is (['""]). The section (?:(['""])(?<src>(?:(?!\1).)*) is a non-capturing group that matches "img.jpg in our example. In particular, (['""]) matches any quote character. Then (?!\1) is a negative lookahead for the quote character matched in the first group, so (?:(?!\1).) matches any character that is not the quote character matched by the first group and (?<src>(?:(?!\1).)*) matches, in a named capturing group, a sequence of characters that precedes the end quote character. Then the following \1 matches the end quote character.

murgatroid99
  • 19,007
  • 10
  • 60
  • 95
2
src=      # matches literal "src="
(?:       # the ?: suppresses capturing. generally a good practice if capturing
          # is not explicitly necessary
  (['"])  # matches either ' or ", and captures what was matched in group 1
          # (because this is the first set of parentheses where capturing is not
          # suppressed)
  (?<src> # start another (named) capturing group with the name "src"
    (?:   # start non-capturing group
      (?!\1)
          # a negative lookahead, if its contents match, the lookahead causes the
          # pattern to fail
          # the \1 is a backreference and matches what was matched in capturing
          # group no. 1
    .)*   # match any character, end of non-capturing group, repeat
          # summary of this non-capturing group: for each character, check that
          # it is not the kind of quote we matched at the start. if it's not,
          # then consume it. repeat as long as possible.

  )       # end of capturing group "src"
  \1      # again a backreference to what was matched inside capturing group 1
          # i.e. match the same kind of quote that started the attribute value
|         # or
  (?<src> # again a capturing group with the name "src"
    [^\s>]+
          # match as many non-space, non-> character as possible (at least one)
  )       # end of capturing group. this case treats unquoted attribute values.
)         # end of non-capturing group (which was used to group the alternation)

Some further reading for you:

If you want to refresh your regex knowledge a bit, I recommend reading through the entire tutorial. It's definitely worth your time.

A few more resources to get help with understanding complicated expressions:

  • Regex 101 generates an explanation from a regex. However, it uses PHP's PCRE engine, so it will choke on some of the .NET features like repeated named capturing groups (in your case the src).
  • Debuggex which lets you step through the regex and generates a flowchart. So far it's regex flavor is even more limited though (to JavaScript's ECMAScript flavour)
  • Regexper which focuses on the flowchart tihng. As of now it's also limited to JavaScript regex flavor, though.
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
1

1>It first captures any 1 of ['""] in group 1 i.e (['""])

2>Then it matches 0 to many character which is not the one captured in group 1 i.e (?:(?!\1).)*

3>It does step 2 till it matches the one captured in group 1 i.e \1

The above 3 steps is similar to (['""])[^\1]*\1

OR

1>it matches all the non space,> characters after src= i.e [^\s>]+


NOTE I would use src=(['""]).*?\1

.* is greedy,it matches as much as it can..

.*? is lazy,it matches as less as it can..

For example,consider this string hello hi world

for regex ^h.*l output would be hello hi worl

for regex ^h.*?l output would be hel

Anirudha
  • 32,393
  • 7
  • 68
  • 89
1

I used RegexBuddy to get this output:

Match the characters “src=” literally «src=»
Match the regular expression below «(?:(['""])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))»
   Match either the regular expression below (attempting the next alternative only if this one fails) «(['""])(?<src>(?:(?!\1).)*)\1»
      Match the regular expression below and capture its match into backreference number 1 «(['""])»
         Match a single character present in the list “'"” «['""]»
      Match the regular expression below and capture its match into backreference with name “src” «(?<src>(?:(?!\1).)*)»
         Match the regular expression below «(?:(?!\1).)*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
            Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!\1)»
               Match the same text as most recently matched by capturing group number 1 «\1»
            Match any single character that is not a line break character «.»
      Match the same text as most recently matched by capturing group number 1 «\1»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «(?<src>[^\s>]+)»
      Match the regular expression below and capture its match into backreference with name “src” «(?<src>[^\s>]+)»
         Match a single character NOT present in the list below «[^\s>]+»
            Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
            A whitespace character (spaces, tabs, line breaks, etc.) «\s»
            The character “>” «>»

This Regex is very bad one for what you described. src=" is a valid input.

Matan Shahar
  • 3,190
  • 2
  • 20
  • 45