4

I want to remove comments in Java code. I have seen a lot of examples, but each was written wrong.

Here is example of code:

String somestring = "http://google.com"; // "//google.com";" is going to be removed

Another example:

    get.setHeader("Accept", "*/*"); // "/*");" and later is going to be removed too

But I want right regular expression which handles those cases

I tried: http://ostermiller.org/findcomment.html Regular expression to remove comment and other popular examples

It should handle common cases:

somemethod();//it should be removed
somemethod(); /* some comment that may end on other line */

But should be handled and other situations:

String somestring = "http://google.com"; // url shouldn't be touched
get.setHeader("Accept", "*/*"); // "*/*" shouldn't be touched too
Community
  • 1
  • 1
BuGiZ400
  • 395
  • 1
  • 4
  • 12
  • 1
    What's your expected output? – Avinash Raj Feb 09 '15 at 13:38
  • Read the bottom paragraph of the link you provided: *The solution to this is to write regular expressions that describe each of the possible larger elements, find these as well, decide what type of element each is, and discard the ones that are not comments. There are tools called lexers or tokenizers that can help with this task.* – aioobe Feb 09 '15 at 13:42
  • You're going to need **[this approach](http://stackoverflow.com/questions/25402109/regex-for-comments-in-strings-strings-in-comments-etc)**. Java is very similar to JavaScript so I imagine some slight tweaks to my answer there could do the trick. – asontu Feb 09 '15 at 13:42
  • A regex seems hard here because you have to count `"`, `\n`, `/*` characters – Arnaud Denoyelle Feb 09 '15 at 13:42
  • Why do you need a regex at all? It might fail at unexpected occurences. How about using a parser for java syntax and using the AST to find comments? – SpaceTrucker Feb 09 '15 at 13:45
  • 1
    Regex is not best tool for this. We would have to check if `//` doesn't exist inside string, which is not so trivial (since some counting `"` can be tricky because some of them may have been `"` literals escaped with `\"`). but we also can't assume that all `"` which have ``\`` before them are literals because there string could also be `"\\"` and in this case last `"` is not `"` literal but correct end of string representing ``\`` character. What you need is parser. – Pshemo Feb 09 '15 at 13:45
  • If you can run Node.js, then [decomment](https://github.com/vitaly-t/decomment) can do what you want. – vitaly-t Mar 12 '16 at 21:28

1 Answers1

7

Already commented this but lets see how far we get. Java doesn't do regex literals so stripping that one from this answer we get the following regex:

((['"])(?:(?!\2|\\).|\\.)*\2)|\/\/[^\n]*|\/\*(?:[^*]|\*(?!\/))*\*\/

Regular expression visualization

Debuggex Demo

If we then "replace" every match with the first capture group, every match that doesn't have a capture group to begin with (i.e. a comment) is removed:

Regex101 substitution Demo

A explanation of the more generic "match this except in conditions a|b|c"-technique I employed is available here.

Community
  • 1
  • 1
asontu
  • 4,548
  • 1
  • 21
  • 29
  • 1
    Nice solution +1 for effort, but in Java some characters can be also represented as Unicode just like ``\`` can can be written as `"\u005C"` and will be treated as ``\`` making string like `"foo\u005C"bar"` valid and equal to `"foo\"bar"`. Because of this your regex can fail https://regex101.com/r/vI2iW5/2 – Pshemo Feb 09 '15 at 14:27
  • 1
    Woah, to my thorough surprise, you are actually right: https://ideone.com/wr9x1W So yes my regex requires you to be sane enough to not write Java code/control characters in `\uXXXX` syntax **o.O** – asontu Feb 09 '15 at 15:04
  • 1
    So your answer assumes sanity of programmer... That is crazy! But yes, with this assumption your answer makes sense. – Pshemo Feb 09 '15 at 17:22
  • that also removes strings in annotations (@Something("foobar")) – itsTyrion Apr 05 '23 at 21:15
  • @itsTyrion it doesn't, only if you remove every instance of the matched regex. What you should do in stead is _**replace**_ the match with the _**first capture group**_ of the match (as mentioned in the answer). That way anything in double quotes (including annotations) will be put back where they were found unaltered. – asontu May 30 '23 at 12:06