0

I have a regular expression as follows:

(\/\*([^*]|[\r\n]|(\*+([^*\/]|[\r\n])))*\*+\/)|(\/\/.*)

And my test string as follows:

<?
/* This is a comment */

cout << "Hello World"; // prints Hello World

/*
 * C++ comments can  also
 */

cout << "Hello World"; 

/* Comment out printing of Hello World:

cout << "Hello World"; // prints Hello World

*/

echo "//This line was not a Comment, but ... ";
echo "http://stackoverflow.com";
echo 'http://stackoverflow.com/you can not match this line';
array = ['//', 'no, you can not match this line!!']
/* This is * //a comment */

https://regex101.com/r/lx2f5F/1

It can matches the line 2, 4, 7~9, 13~17 correctly.

But it also matches single quotes('), double quotes(") and array in the last line, how to Non-greedy Matching?

Any help would be gratefully appreciated.

kkasp
  • 113
  • 1
  • 9
  • What's the regex *for*? And *why*? – jonrsharpe Mar 14 '17 at 04:34
  • Clearly mention what or what lines you don't want to match ... if we have to understand it from your regex then your regex has to be correct which is not therefore there is no point investing time for guessing ... – Mustofa Rizwan Mar 14 '17 at 05:31
  • This question already has answer, You can have a look here! http://stackoverflow.com/a/41867753/2012407 – antoni Mar 14 '17 at 06:17
  • Possible duplicate of [match "//" comments with regex but not inside a quote](http://stackoverflow.com/questions/4568410/match-comments-with-regex-but-not-inside-a-quote) – VDWWD Mar 14 '17 at 07:53
  • to antoni and VDWWD: no, it can not solve the problem, it also matched url path – kkasp Mar 15 '17 at 08:02

2 Answers2

1

I believe I have a new best pattern for you.
/\/\*[\s\S]*?\*\/|(['"])[\s\S]+?\1(*SKIP)(*FAIL)|\/{2}.*/
This will accurately process the following block of text in just 683 steps:

<?
/* This is a comment */

cout << "Hello World"; // prints Hello World

/*
 * C++ comments can  also
 */

cout << "Hello World"; 

/* Comment out printing of Hello World:

cout << "Hello World"; // prints Hello World

*/

echo "//This line was not a Comment, but ... ";
echo "http://stackoverflow.com";
echo 'http://stackoverflow.com/you can not match this line';
array = ['//', 'no, you can not match this line!!']
/* This is * //a comment */

Pattern Explanation: (Demo *you can use the Substitution box at the bottom to replace the comment substrings with an empty string -- effectively removing all comments.)

/\/\*[\s\S]*?\*\/ Match \* then 0 or more characters then */
| OR
(['"])[\s\S]*?\1(*SKIP)(*FAIL) Don't match ' or " then 1 or more characters then the leading (captured) character
| OR
\/{2}.*/ Match // then zero or more non-newline characters

Using [\s\S] is like . except it allows newline characters, this is deliberately used in the first two alternatives. The third alternative intentionally uses . to stop when a newline character is found.

I have checked every sequence of alternatives, to ensure that the fastest alternatives come first and the pattern is optimized. My pattern correctly matches the OP's sample input. If anyone finds an issue with my pattern, please leave me a comment so that I can try to fix it.


Jan's pattern correctly matches all of the OP's desired substrings in 1006 steps using: ~([\'\"])(?<!\\).*?\1(*SKIP)(*FAIL)|(?|(?P<comment>(?s)\/\*.*?\*\/(?-s))|(?P<comment>\/\/.+))~gx

Sahil's pattern fails to completely match the final comment in your UPDATED sample input. This means either the question is wrong and should be closed as "unclear what you are asking", or Sahil's answer is wrong and it should not be awarded the green tick. When you updated your question, you should have requested that Sahil update his answer. When incorrect answers fail to satisfy the question, future SO readers are likely to become confused and SO becomes a less reliable resource.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
0

With PCRE you can use the (*SKIP)(*FAIL) mechanism:

([\'\"])(?<!\\).*?\1(*SKIP)(*FAIL)
|
(?|
    (?P<comment>(?s)/\*.*?\*/(?-s))
    |
    (?P<comment>//.+)
)

See a working demo on regex101.com.
Note: The branch reset (?|...) is not really needed here but was merely used to make clear the group called comment.

Jan
  • 42,290
  • 8
  • 54
  • 79