3

I need a regex that matches a specific capturing group which falls inside a multiline comment /* ... */.

In particular I need to find PHP variable definitions inside multiline comments

for example:

/* other code $var = value1 */
$var = value2 ;

/* 
other code
$var = value3 ;
other code
*/

must match only the two occurences of '$var =' inside the comments but not the one outside the comment.

for the above example I wrote a regex that uses unrestricted lookbehind, like this

(?<=[/][\*][^/]+)(\$var) | (?<=[/][\*][^\*]+)(\$var)

but this regex fails in case it finds both charachter * and / even if they are APART from one another, between the comment opening tag '/*' and $var, which is not the desired bahaviour:

for example it fails in the case:

$var = .... ;

/* 
other * code /
$var = .... ;
other code
*/

bacause it finds both '*' and '/' even if it's not the comment closing tag.

The key point is that I cannot negate a token which is combination of two charachter, but can only negate them one by one: [^*] or [^/].

...furthermore I cannot use the token [\s\S] instead of [^/] and [^*] because it would select $var out of comments preceded by a previous block of comment.

Any ideas? Is it even possibile with normal regex to achieve this? Or would I need something different?

Obomar
  • 61
  • 1
  • 7
  • 1
    How about using [`\G` like in this demo at regex101](https://regex101.com/r/eO9fU4/1). – bobble bubble Jun 10 '16 at 13:52
  • Thank you! This regex answers the question. Using the meta charachter \G works nice! Only problem is that it's a little hard to understand for beginners... I understand why the usage in this case but I'm still not exactly confortable with the general meaning of (?!^) – Obomar Jun 10 '16 at 16:19
  • Great it helps. I put an answer with some explanation. – bobble bubble Jun 10 '16 at 18:41

5 Answers5

2

This matches just $var, and only inside a multiline comment:

(?s)\$var(?=(?:(?!/\*|\*/).)*\*/)

DEMO

(?:(?!/\*|\*/).)* is a captive lookahead (also known as a Tempered Greedy Token--good name, but too many syllables), and it's how you exclude a sequence, as opposed to a single character. This one matches zero or more of any character (including newline, because of the (?s)), as long as it's not the first character of /* or */.

The enclosing lookahead succeeds if it finds */ without first encountering /*. That means the current position must be inside a comment (there's no need to match the opening /*). And because the lookahead doesn't consume any characters, you can match more than one item per comment, if you need to.

One thing that can fool this regex is a */ that's not really comment closer. So these:

$var = "*/";

$var = ...;
// */

... would match, even though they're not in a comment.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • you are right, your regex works as expected except for and cases you mentioned, and I would add also another case: it doesn't match $var in /* $var /* code... */ – Obomar Jun 10 '16 at 16:12
  • ...and I'm sure we could come up with more ways for it to fail. As @Toto said elsewhere, to do this right, you need an actual parser. In fact, if I had thought you needed to match the values being *assigned* to `$var`, I won't even have tried. – Alan Moore Jun 10 '16 at 18:51
  • Thank you @AlanMoore for your answer, from my point of view your regex is the most elegant and understandable, but unfortunately I need to manage cases of matching strings between possible opening delimiters, otherwise I would use your regex. I should've mentioned it, sorry. Luckily for the purpose of this question I don't need a parser to check for quoted delimiters or to manage nested comments (and treat them as such), because in my case quoted delimiters are absent or very rare, just need to replicated the most common multiline comments behavior you experience in programming-text editors :) – Obomar Jun 11 '16 at 17:00
1

How about:

$str = '
/* other code */
$var = "var1";

/* 
other code
$var = "var2";
other code
*/
/* other code */
$var = "var3";

/* 
other code / <-- a slash here
$var = "var4";
other code
*/';

preg_match_all('~/\*(?:(?!\*/).)+?(\$var = .+?;).*?\*/~s', $str, $m);
print_r($m[1]);

Output:

Array
(
    [0] => $var = "var2";
    [1] => $var = "var4";
)
Toto
  • 89,455
  • 62
  • 89
  • 125
  • Negative lookahead is just `(?!`, not `(?!=`. Also, the enclosing group has to consume only one character at a time. As it is, your regex is only working by accident. – Alan Moore Jun 10 '16 at 14:25
  • your solution works with the example you provided but seems to fail in a more general scenario like the one proposed by @AlanMoore – Obomar Jun 10 '16 at 16:05
  • 1
    @Obomar: Yes, it will fail in some cases. If you want to pass evry cases, you have to write a parser. – Toto Jun 10 '16 at 16:09
1

Idea by use of \G to glue matches to /*

(?:/\*|\G(?!^))(?:(?!\*/)[^$])*\K\$var\s*=\s*(?:(?!\*/)[^$;])*

Might be hard to understand if you aren't doing a lot with regexes. See regex101 for demo.

\G can be seen as "glue", it is continuing at the end of a previous match. But \G also matches the start of the string. That's why the negative lookahead is used \G(?!^) only need to continue.

  • /\*|\G(?!^) This part is to find the beginning of a match at /* or continue matching.

  • (?:(?!\*/)[^$])* Match any ammount of characters that are not $ (negated class) while not ending the comment (?!\*/) for stuff before/between $var

  • \K\$var \K resets beginning of the reported match before $var occurs. \K can be useful as an alternative to a variable width lookebhind which is not available in pcre.

  • \s*=\s*(?:(?!\*/)[^$;])* to match the value of the variable. This is far from perfect. Would need modification if quoted values or not convenient for your input. After = it matches [^$;] characters, that are not dollar or semicolon (?!\*/) as long there's no */ ahead.

This regex does not check if there is actually a comment-end */ it just binds matches to /*
Another idea would be to use kind of this trick with verbs (*SKIP)(*FAIL) like in this demo.

Community
  • 1
  • 1
bobble bubble
  • 16,888
  • 3
  • 27
  • 46
0

Something like this might work:

/\/\*.*?\$var\s*\=\s(.*?)(?=\s*;)/s

Usage:

$str = '$var = .... ;
/*
other code
$var = ..... ;
other code
*/';
preg_match('/\/\*.*?\$var\s*\=\s(.*?)(?=\s*;)/s', $str, $matches);

var_dump($matches);

Will output:

array(2) {
  [0]=>
  string(26) "/*
other code
$var = ....."
  [1]=>
  string(5) "....."
}

And your string is stored in $matches[1]

Try it online

Andreas Louv
  • 46,145
  • 13
  • 104
  • 123
  • Thank you, but unfortunately your solution matches comments too and doesn't seem to consider case of previous block of comments... – Obomar Jun 10 '16 at 13:10
  • 1
    @Obomar the match is captured in match group `1` not `0`: `$matches[1]` – Andreas Louv Jun 10 '16 at 13:11
  • right, your solution conceptually does work in the example used, which I provided, but which is not complete (sorry for that). I updated the question and changed the example to a more general scenario in which there are more that one block of multiline comments: consider the /* */ $var /* code.. $var ..code.. */ would it still work? It seems it would match the $var outside comments too. – Obomar Jun 10 '16 at 13:37
0

Try on php, but java works

(?s)(?i)(^|\s+?)(/*)((.)(?!*/))?(this)(.?)(*/)

in this example finding word is "this"

Maneskin
  • 11
  • 1