4

Can you explain me how this works? Here is an example:

<!-- The quick brown fox 
              jumps over the lazy dog -->

<!--[if IE 7]>
    <link rel="stylesheet" type="text/css" href="/supersheet.css" />
<![endif]-->

<!-- Pack my box with five dozen liquor jugs -->

First, I tried to use the following regular expression to match the content inside conditional comments:

/<!--.*?stylesheet.*?-->/s

It failed, as the regular expression matches all the content before the first <!-- and the last -->. Then I tried using another pattern with a lookahead assertion:

/<!--(?=.*?stylesheet).*?-->/s

It works and matches exactly what I need. However, the following regular expression works as well:

/<!--(?=.*stylesheet).*?-->/s

The last regular expression does not have a reluctant quantifier in the lookahead assertion. And now I am confused. Can anyone explain me how it works? Maybe there is a better solution for this example?

Updated:

I tried usig the regular expressions with lookahead assertion in another document, and it failed to mach the content between the comments. So, this one /<!--(?=.*?stylesheet).*?-->/s (as well as this one /<!--(?=.*stylesheet).*?-->/s) is not correct. Do not use it and try other suggestions.

Updated:

The solution has been found by Jonny 5 (see the answer). He suggested three options:

  1. Using of a negated hyphen to limit match. This option works only if there is no a hyphen between the tags. If a stylesheet has an URL /style-sheet.css, it will not work.
  2. Using of escape sequence: \K. It works like a charm. The downsides are the following:
    • It is terribly slow (in my case, it was 8-10 times slower than the other solutions)
    • Only available since PHP 5.2.4
  3. Using a lookahead to narrow the match. This is the goal I tried to achieve, but my expirience of using lookaround assertions was insufficient to perform the task.

I think the following is a good solution for my example:

/(?s)<!--(?:(?!<!).)+?stylesheet.+?-->/

The same but with the s modifier at the end:

/<!--(?:(?!<!).)+?stylesheet.+?-->/s

As I said, this is a good solution, but I managed to improve the pattern and found another one that in my case works faster.

So, the final solution is the following:

/<!--(?:(?!-->).)+?stylesheet.+?-->/s

Thanks all the participants for interesting answers.

El cero
  • 607
  • 5
  • 13
  • 1
    If you want to match *` [see regex101](https://regex101.com/r/aG2lZ0/1). Or put a greedy dot that eats up before and [\K](http://www.rexegg.com/regex-php.html#K) reset after [.*\K](https://regex101.com/r/lZ5pX3/1) or use a capture group. – Jonny 5 Aug 16 '15 at 05:11
  • Could you please expand your answer? I think it would be interesting for others, and for me as well. – El cero Aug 16 '15 at 06:01

2 Answers2

2

The string stylesheet is mentioned only one time in your test document, so both regular expressions you tried will match the same thing but in different ways.

<!--(?=.*?stylesheet).*?-->/s

This one does the following:

  • Capture <!--.
  • Look ahead, capturing characters up to and including stylesheet. Fail if not found.
  • Capture characters up to and including -->.
<!--(?=.*stylesheet).*?-->/s

This one does the following:

  • Capture <!--.
  • Look ahead, capturing any character until no longer possible. Backtrack, continuously trying to match stylesheet. Fail if not found.
  • Capture characters up to and including -->.

Basically, one needs to backtrack significantly while the other doesn't.

If your subject instead is...

<!-- The quick brown fox 
              jumps over the lazy dog -->

<!--[if IE 7]>
    <link rel="stylesheet" type="text/css" href="/supersheet.css" /> <![endif]-->

<!-- Pack my box with five dozen stylesheets -->

you get two different results. The former would find the first stylesheet, while the latter would find the second (and last) since it starts searching from the end of the string.

Anonymous
  • 11,748
  • 6
  • 35
  • 57
  • Very interesting answer! Thank you, @Anonymous! The solution has been found. Please, see the updated post. – El cero Aug 17 '15 at 00:21
2

To match only the part <!--...stylesheet...--> there are many ways:

1.) Use a negated hyphen [^-] to limit the match and stay in between <!-- and stylesheet

(?s)<!--[^-]+stylesheet.+?-->

[^-] allows only characters, that are not a hyphen. See test at regex101.


2.) To get the "last" or closest match without much regex effort, also can put a greedy dot before to ᗧ eat up. Makes sense if not matching globally / only one item to match. Use \K to reset after the greed:

(?s)^.*\K<!--.+?stylesheet.+?-->

See test at regex101. Also can use a capture group and grab $1: (?s)^.*(<!--.+?stylesheet.+?-->)


3.) Using a lookahead to narrow it down is usually more costly:

(?s)<!--(?:(?!<!).)+?stylesheet.+?-->

See test at regex101. (?!<!). looks ahead at each character in between <!-- and stylesheet if not starting another <!... to stay inside one element. Similar to the negated hyphen solution.


Instead of .* I used .+ for one or more - depends on what to be matched. Here + fits better.
What solution to use depends on the exact requirements. For this case I would use the first.

Jonny 5
  • 12,171
  • 2
  • 25
  • 42
  • @Mr.twister Welcome! I read your update - that the second solution is slower for your input. Actually this should be pretty fast. Depends on input. I did an edit and added a `^` start [anchor](http://www.regular-expressions.info/anchors.html) to second solution for avoiding unnecessary backtracking. For Many cases the lookahead solution would be slower. Your modification to `(?!-->).` is fine - more specific. – Jonny 5 Aug 17 '15 at 03:27