Remove comments from a source code in java

Question

I want to remove all type of comments statements from a java source code file. Example:

    String str1 = "SUM 10"      /*This is a Comments */ ;   
    String str2 = "SUM 10";     //This is a Comments"  
    String str3 = "http://google.com";   /*This is a Comments*/
    String str4 = "('file:///xghsghsh.html/')";  //Comments
    String str5 = "{\"temperature\": {\"type\"}}";  //comments

Expected Output:

    String str1 = "SUM 10"; 
    String str2 = "SUM 10";  
    String str3 = "http://google.com";
    String str4 = "('file:///xghsghsh.html/')";
    String str5 = "{\"temperature\": {\"type\"}}";

I am using the below regular expression to achieve :

    System.out.println(str1.replaceAll("[^:]//.*|/\\\\*((?!=*/)(?s:.))+\\\\*/", ""));

This gives me wrong result for str4 and str5. Please help me to resolve this issue.

Using Andreas solutions:

        final String regex = "//.*|/\\*(?s:.*?)\\*/|(\"(?:(?<!\\\\)(?:\\\\\\\\)*\\\\\"|[^\\r\\n\"])*\")";
        final String string = "    String str1 = \"SUM 10\"      /*This is a Comments */ ;   \n"
             + "    String str2 = \"SUM 10\";     //This is a Comments\"  \n"
             + "    String str3 = \"http://google.com\";   /*This is a Comments*/\n"
             + "    String str4 = \"('file:///xghsghsh.html/')\";  //Comments\n"
             + "    String str5 = \"{\"temperature\": {\"type\"}}";  //comments";
        final String subst = "$1";

        // The substituted value will be contained in the result variable
        final String result = string.replaceAll(regex,subst);

        System.out.println("Substitution result: " + result);

Its working except str5.

You’ll never be able to cover all cases with a regular expression. You need to parse the code with a proper parser. — bfontaine, Jul 05 '19 at 16:31
Unless you have a priori knowledge of what kind of source code you will encounter, anything short of a full-fledged parser is a risky approach. There is a github project in java that claims to do the job: https://github.com/ertugrulcetin/CommentRemover — collapsar, Jul 05 '19 at 17:07

Andreas · Accepted Answer · 2019-07-05T17:50:43.030

To make it work, you need to "skip" string literals. You can do that by matching string literals, capturing them so they can be retained.

The following regex will do that, using $1 as the substitution string:

//.*|/\*(?s:.*?)\*/|("(?:(?<!\\)(?:\\\\)*\\"|[^\r\n"])*")

See regex101 for demo.

Java code is then:

str1.replaceAll("//.*|/\\*(?s:.*?)\\*/|(\"(?:(?<!\\\\)(?:\\\\\\\\)*\\\\\"|[^\r\n\"])*\")", "$1")

Explanation

//.*                      Match // and rest of line
|                        or
/\*(?s:.*?)\*/            Match /* and */, with any characters in-between, incl. linebreaks
|                        or
("                        Start capture group and match "
  (?:                      Start repeating group:
     (?<!\\)(?:\\\\)*\\"     Match escaped " optionally prefixed by escaped \'s
     |                      or
     [^\r\n"]                Match any character except " and linebreak
  )*                       End of repeating group
")                        Match terminating ", and end of capture group

$1                        Keep captured string literal

It works except last string. final String regex = "//.*|/\\*(?s:.*?)\\*/|(\"(?:(?<!\\\\)(?:\\\\\\\\)*\\\\\"|[^\\r\\n\"])*\")"; final String string = " String str1 = \"SUM 10\" /*This is a Comments */ ; \n" + " String str2 = \"SUM 10\"; //This is a Comments\" \n" + " String str4 = \"('file:///xghsghsh.html/')\"; //Comments\n" + " String str5 = \"{\"temperature\": {\"type\"}}"; //comments"; final String subst = "$1"; final String result = string.replaceAll(regex,subst); System.out.println("Substitution result: " + result); — Anand, Jul 07 '19 at 12:01

score 0 · Answer 2 · answered Jul 05 '19 at 17:30

{...wishing I could comment...}

I recommend a two-pass process; one based upon end of line (//) the other not (/* */).

I like Pavel's idea; however, I don't see how it checks to make sure the star is the next character after a slash and vice versa on closing out.

I like Andreas' idea; however, I wasn't able to get it to work on multi-line comments.

https://docs.oracle.com/javase/specs/jls/se12/html/jls-3.html#jls-CommentTail

score -1 · Answer 3 · edited Jun 20 '20 at 09:12

Maybe, it would be best to start with multiple simple expressions, step by step, such as:

.*(\s*\/\*.*|\s*\/\/.*)

to initially remove the inline comments.

Demo

Test

import java.util.regex.Matcher;
import java.util.regex.Pattern;

final String regex = "(.*)(\\s*\\/\\*.*|\\s*\\/\\/.*)";
final String string = "    String str1 = \"SUM 10\"      /*This is a Comments */ ;   \n"
     + "    String str2 = \"SUM 10\";     //This is a Comments\"  \n"
     + "    String str3 = \"http://google.com\";   /*This is a Comments*/\n"
     + "    String str4 = \"('file:///xghsghsh.html/')\";  //Comments\n"
     + "    String str5 = \"{\\\"temperature\\\": {\\\"type\\\"}}\";  //comments";
final String subst = "\\1";

final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);

// The substituted value will be contained in the result variable
final String result = matcher.replaceAll(subst);

System.out.println("Substitution result: " + result);

The problem is when a string literal contains what looks like a comment, e.g. how `str4` contains `//`. If you remove the actual comment from that line, your regex will treat the `//` in the literal as a comment. — Andreas, Jul 05 '19 at 17:00

score -1 · Answer 4 · answered Jul 05 '19 at 16:59

-1

As others said, regex is not a good option here. You could use a simple DFA for this task.
Here's an example that will get you intervals of multiple line comments (/* */).
You can do the same way for single line comments (// -- \n).

    String input = ...; //here's your input String

    //0 - source code, 
    //1 - multiple lines comment (start) (/ char)
    //2 - multiple lines comment (start) (* char)
    //3 - multiple lines comment (finish) (* char)
    //4 - multiple lines comment (finish) (/ char)
    byte state = 0; 
    int startPos = -1;
    int endPos = -1;
    for (int i = 0; i < input.length(); i++) {
        switch (state) {
        case 0:
            if (input.charAt(i) == '/') {
                   state = 1;
                   startPos = i;
            }
            break;
        case 1:
            if (input.charAt(i) == '*') {
                state = 2;
            }
            break;
        case 2:
            if (input.charAt(i) == '*') {
               state = 3;
            }
            break;
        case 3:
            if (input.charAt(i) == '/') {
                state = 0;
                endPos = i+1;

                //here you have the comment between startPos and endPos indices,
                //you can do whatever you want with it
            }

            break;
        default:
            break;
        }
    }

answered Jul 05 '19 at 16:59

Pavel Smirnov

4,611
3
18
28

DFAs are even less expressive than regexen, so how would your approach work ? – collapsar Jul 05 '19 at 17:02
@collapsar, what do you mean "less expressive"? They can give exactly what OP wants. Lexers, parsers are all based on DFA or NFA. – Pavel Smirnov Jul 05 '19 at 17:05
D/NFAs can only detect regular languages. Which is less than common regex engines handle (eg. they allow recognizing correctly bracketed expressions). Programming languages and Java in particular are not regular but context-free (disregarding type checking). In particular, Programming language parsers are _not_ based on D/NFAs. To reliably identify comments in Java sources you need to parse the source code (though you may get away with D/NFAs 'most of the time') – collapsar Jul 05 '19 at 17:16
Does your DFA correctly handle `/* String blah = "*/"; */` ? – collapsar Jul 05 '19 at 17:19
@collapsar, "D/NFAs can only detect regular languages" - not really. They're used to split input text into tokens and process them. Some parser generators like ANTLR or javaCC use a DFA. `/* String blah = "*/"; */` - does your java compiler handle this? :) It's an invalid comment. – Pavel Smirnov Jul 05 '19 at 17:24
`"D/NFAs can only detect regular languages" - not really.` . Oh yes, that is exactly what they do. Consult any elementary textbook on formal languages to see that. `It's an invalid comment` - oh yes, you are right on this one - no nested comments in Java ... – collapsar Jul 05 '19 at 17:32
`They're used to split input text into tokens and process them.`. – collapsar Jul 05 '19 at 17:34
`They're used to split input text into tokens and process them.` - close. 'Processing them' is exactly what goes beyond the capabilities of D/NFAs only (which, strictly speaking, cannot process anything). The D/NFA part does nothing but segmenting the input into comments. EOL comments can be handled this way, multiline comments: nope, if they can be nested. You have a point wrt Java, as nesting is not permitted (which I forgot). However, regexen will do fine. Btw. Does your code handle `String blah = "/* surprise */";`correctly ? ;) – collapsar Jul 05 '19 at 17:41
@collapsar, regular languages can be parsed by DFA - that is the job of a lexer. Context-free grammars can be parsed by a parser which is a **DFA + a stack**, but both are **based** on DFA. [Check this, for instance](https://stackoverflow.com/questions/2842809/lexers-vs-parsers) `String blah = "/* surprise */";` - that is not even a comment, but the value of the String. – Pavel Smirnov Jul 05 '19 at 17:48
`both are based on DFA.` - imho this is a very strange reasoning, skipping over an _essential_ part of what a parser is based upon. However, let's move this thread to chat if you are interested to follow up, it is not directly related to the question. `that is not even a comment, but the value of the String.` - precisely, but since your dfa does not account for string literals, it will delete the string content. – collapsar Jul 05 '19 at 17:54
@collapsar, it's just an example I wrote to show the general idea. Double-quoted literals and single-quoted literals can be implemented the same way. – Pavel Smirnov Jul 05 '19 at 17:58
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/196059/discussion-between-collapsar-and-pavel-smirnov). – collapsar Jul 05 '19 at 18:08

Remove comments from a source code in java

4 Answers4

Demo

Test