26

I have a string for e.g.

String src = "How are things today /* this is comment *\*/ and is your code  /*\* this is another comment */ working?"

I want to remove /* this is comment *\*/ and /** this is another comment */ substrings from the src string.

I tried to use regex but failed due to less experience.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
hanumant
  • 1,091
  • 4
  • 15
  • 27
  • 7
    Parsing Java code with regex is not something I'd recommend. – Confluence Oct 22 '12 at 15:49
  • @Confluence, I am not sure what could be the best option to achieve the result? Can you suggest one. – hanumant Oct 22 '12 at 15:52
  • What regex did you try? As you already say that you have tried something, you can as well just paste it here, so we can see your approach. We can go into more/less details about the solutions depending on your experience. – brimborium Oct 22 '12 at 15:52
  • /\\*.*\\/ this is what I used ...And it removed whole string after the first match – hanumant Oct 22 '12 at 15:59
  • from https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch07s06.html, you can use either `/\*.*?\*/` or `/\*[\s\S]*?\*/` – psykid Jul 14 '21 at 13:00

8 Answers8

56

The best multiline comment regex is an unrolled version of (?s)/\*.*?\*/ that looks like

String pat = "/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/";

See the regex demo and explanation at regex101.com.

In short,

  • /\* - match the comment start /*
  • [^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
  • (?:[^/*][^*]*\*+)* - 0+ sequences of:
    • [^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
  • / - closing /

David's regex needs 26 steps to find the match in my example string, and my regex needs just 12 steps. With huge inputs, David's regex is likely to fail with a stack overflow issue or something similar because the .*? lazy dot matching is inefficient due to lazy pattern expansion at each location the regex engine performs, while my pattern matches linear chunks of text in one go.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • How did you come up with this? – JiaChen ZENG Oct 07 '17 at 14:15
  • 3
    @AT-Aoi It is basically taken from *Mastering Regular Expressions*, *Removing C Comments* section. – Wiktor Stribiżew Oct 07 '17 at 14:21
  • 1
    This has a bug, it incorrectly extends comments that consists only of asterisks past the closing `*/`. A small, syntactically correct C snippet demonstrates this issue: `/**/ Incorrectly removed /**/`. – jerry Jun 05 '19 at 20:37
  • 1
    @jerry I introduced a change some time ago, trying to accommodate for repeating asterisks at the start. Rolled back to the original version. Now, your issue is [not repro](https://regex101.com/r/dU5fO8/73). – Wiktor Stribiżew Jun 05 '19 at 20:43
  • 2
    An assumption like "because the .*? lazy dot matching is inefficient" cannot be made in general without referring to a specific regex engine and version. Even if it holds true for some engine, it may not hold true for another one and not even for a different version of the same one. It's not defined how a regex engine works; that's comparable to SQL not specifying how a database really works under the hood. – Mecki Mar 22 '20 at 02:17
  • Your ``my regex`` link still links to the old broken version that fails for /***/. – erg Aug 01 '20 at 15:00
  • awesome solution :) – Nav Jan 23 '21 at 18:48
  • Do you know how to fix the case where the comment is actually a part of a string? I posted it here: https://stackoverflow.com/q/66301705 – john c. j. Feb 21 '21 at 12:09
  • How about finding multiple line comments having a certain word like `foo`? – Hasanuzzaman Sattar Dec 14 '21 at 11:53
  • @HasanuzzamanSattar Using POSIX-like pattern here would be hard, [I'd suggest](https://regex101.com/r/dU5fO8/177) `(?s)/\*(?:(?!/\*|\*/).)*?foo(?:(?!/\*|\*/).)*\*/`, but note it is not going to be efficient. The best approach here is to match all of the comments and just filter out those containing some other pattern. – Wiktor Stribiżew Dec 14 '21 at 12:00
  • @WiktorStribiżew It's not working at Dreamweaver search tool. I want match all c/C++ style multiple lines comments where `foo` word is present inside the comment at least once. – Hasanuzzaman Sattar Dec 14 '21 at 12:09
  • 1
    @HasanuzzamanSattar Then it does not use Java / PCRE compliant regex engine. Probably, they use some kind of ECMAScript there, so you need `/\*(?:(?!/\*|\*/)[\w\W])*?foo(?:(?!/\*|\*/)[\w\W])*\*/`. – Wiktor Stribiżew Dec 14 '21 at 12:16
22

Try using this regex (Single line comments only):

String src ="How are things today /* this is comment */ and is your code /* this is another comment */ working?";
String result=src.replaceAll("/\\*.*?\\*/","");//single line comments
System.out.println(result);

REGEX explained:

Match the character "/" literally

Match the character "*" literally

"." Match any single character

"*?" Between zero and unlimited times, as few times as possible, expanding as needed (lazy)

Match the character "*" literally

Match the character "/" literally

Alternatively here is regex for single and multi-line comments by adding (?s):

//note the added \n which wont work with previous regex
String src ="How are things today /* this\n is comment */ and is your code /* this is another comment */ working?";
String result=src.replaceAll("(?s)/\\*.*?\\*/","");
System.out.println(result);

Reference:

ThomasW
  • 16,981
  • 4
  • 79
  • 106
David Kroukamp
  • 36,155
  • 13
  • 81
  • 138
  • 9
    It will be fun when you have a string that contains the comment sequences. – nhahtdh Oct 22 '12 at 15:59
  • Can you explain how the multiline regex works? I don't understand the *(?s)* (and the link doesn't help in that regard). – brimborium Oct 22 '12 at 16:01
  • @nhahtdh +1 Lol yes another story completely ( I guess you just check the String before for any conflicating series of characters and replace them with something safe until regex is complete then replace them back? – David Kroukamp Oct 22 '12 at 16:01
  • 3
    @brimborium: `(?s)` is DOTALL, which means `.` will match even new line character (which it won't match by default). – nhahtdh Oct 22 '12 at 16:02
  • @DavidKroukamp: Something safe <-- quite tricky if you want it 100% safe. Before that, we have to recognize the bounds of the string first. – nhahtdh Oct 22 '12 at 16:04
  • @nhahtdh Thanks, gskinner didn't know about that one. ;) – brimborium Oct 22 '12 at 16:04
  • in PHP it wont works as "/\\*.*?\\*/", but "/(\/\*.*?\*\/)+/" – Digerkam Mar 10 '18 at 21:57
4

Try this one:

(//[^\n]*$|/(?!\\)\*[\s\S]*?\*(?!\\)/)

If you want to exclude the parts enclused in " " then use:

(\"[^\"]*\"(?!\\))|(//[^\n]*$|/(?!\\)\*[\s\S]*?\*(?!\\)/)

the first capturing group identifies all " " parts and second capturing group gives you comments (both single line and multi line)

copy the regular expression to regex101 if you want explanation

Akshay
  • 89
  • 1
  • 11
1
(?s)(?i)(^|\s+?)(\/\*)((.)(?!\*\/))*?(this)(.*?)(\*\/)

You can find inner comment's words:

Maneskin
  • 11
  • 1
0

Can't parse C/C++ style comments in Java source directly.
Quoted strings have to be parsed at the same time and within the same regex
because the string may embed /* or //, the start of a comment when it is just part
of the string.

Note there is additional regex consideration needs if raw strings constructs
are possible in the language.

The regex that does this feat is this.
Where group 1 contains the Comment and group 2 contains the Non-Comment.
For example if you were removing comments it would be:

Find
(/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n|$))|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)

Replace
$2


Stringed:
"(/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\(?:\\r?\\n)?)*?(?:\\r?\\n|$))|(\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|[\\S\\s][^/\"'\\\\]*)"

-1
System.out.println(src.replaceAll("\\/\\*.*?\\*\\/ ?", ""));

You have to use the non-greedy-quantifier ? to get the regex working. I also added a ' ?' at the end of the regex to remove one space.

jens-na
  • 2,254
  • 1
  • 17
  • 22
-1

Try this which worked for me:

System.out.println(src.replaceAll("(\/\*.*?\*\/)+",""));
Digerkam
  • 1,826
  • 4
  • 24
  • 39
-1

This could be the best approach for multi-line comments

System.out.println(text.replaceAll("\\/\\*[\\s\\S]*?\\*\\/", ""));

Mahesh Yadav
  • 378
  • 3
  • 6