0

I am working on a personal project where I need to extract the actual comments from an input string like this.

Case 1: /* Some useful text */

Output: Some useful text

Case 2: /*** This is formatted obnoxiously**/

Output: This is formatted obnoxiously

Case 3:

    /**

    More useful
information

    */

Output: More useful information

Case 4:

/**
Prompt the user to type in 
the number. Assign the number to v
*/

Output: Prompt the user to type in the number. Assign the number to v

I am working in Java and I have tried to replace /* and */ using naive method such as String.replace but since a comment can be formatted in different ways like above, the replace method seems not to be a viable approach to do this. How can I achieve the above outputs using regex?

Here is the test comment file that I am using.

hyde
  • 2,525
  • 4
  • 27
  • 49

2 Answers2

2

Try something like :

"/\\*+\\s*(.*?)\\*+/"

And dot should match also new lines:

Pattern p = Pattern.compile("/\\*+\\s*(.*?)\\*+/", Pattern.DOTALL);

EDIT

 Pattern p = Pattern.compile("/\\*+\\s*(.*?)\\*+/", Pattern.DOTALL); 
 Matcher m = p.matcher("/*** This is formatted obnoxiously**/");
 m.find();
 String sanitizedComment = m.group(1); 
 System.out.println(sanitizedComment);
Stephan
  • 8,000
  • 3
  • 36
  • 42
  • `.` doesn't match new-lines in Java (not by default anyway, not sure if there's a way to set that). You need `(.|\n)` – Bernhard Barker Apr 17 '13 at 07:28
  • @Dukeling: There is a way to set it in Java (DOTALL option). It is not a good idea to write `(.|\n)`, since you might miss out some characters. `.` excludes more than just `\n` in Java. – nhahtdh Apr 17 '13 at 07:34
  • @Dukeling nhahtdh is right , i've updated my answer to show you how you can make dot to match new lines – Stephan Apr 17 '13 at 08:00
  • @Stephan that didn't work. I got `IllegalStateException` because there were no matches. `Pattern p = Pattern.compile("/\\*+\\s*(.*?)\\*+/", Pattern.DOTALL); Matcher m = p.matcher(matchedComment); String sanitizedComment = m.group(); System.out.println(sanitizedComment);` – hyde Apr 17 '13 at 08:07
  • @Stephan, I did as you said, and everything works before the highlighted file in the code that O have uploaded [here](http://pastie.org/private/5hjraugtci52u43rvcuw#29-31) – hyde Apr 17 '13 at 08:42
  • @NullGeo i do not get it... for me it works : String test = "/** "+System.getProperty("line.separator") +" Estimate the square root and assign it to x_0 "+System.getProperty("line.separator") +" */"; Pattern p = Pattern.compile("/\\*+\\s*(.*?)\\*+/", Pattern.DOTALL); Matcher m = p.matcher(test); boolean a = m.find(); String sanitizedComment = m.group(1); System.out.println(sanitizedComment); – Stephan Apr 17 '13 at 08:52
  • @Stephan But you're using strings without newlines in you test cases. Try it with newlines. [Here](http://pastie.org/private/gb4genjs0lzbzjgttkdrg#23,41) is the comment file I am using – hyde Apr 17 '13 at 08:55
  • @NullGeo i am using new lines : System.getProperty("line.separator") output the string if you don't believe me – Stephan Apr 17 '13 at 09:05
  • @Stephan, sorry about that. In fact it was working like you said. But look the problem is that, my other regex for finding comment is giving me error values. I used the regex from this page http://stackoverflow.com/questions/1657066/java-regular-expression-finding-comments-in-code to find block comments but for the test file, it's outputting `"Find the square root of " "%lf" ` Highlighted lines are being taken as comments: http://pastie.org/private/5hjraugtci52u43rvcuw#26,40 Anyway better way to get block C-style block comments using regex? – hyde Apr 17 '13 at 09:18
  • @NullGeo if the regex from that question didn't work out for you did you try using the one provided by me or Keppil ? maybe they will have better results – Stephan Apr 17 '13 at 09:39
1

You can use the following regex:

String newString = oldString.replaceAll("/\\*+\\s*|\\s*\\*+/", "");

EDIT

To also get rid of newlines you could do something like:

String regex = "/\\*+\\s*|\\s*\\*+/|[\r\n]+";
String newString = oldString.replaceAll(regex, "");
Keppil
  • 45,603
  • 8
  • 97
  • 119
  • Awesome, It worked. Thanks! Now I have one more question, I am using the following escaped string for finding comments in the file. `//.*|(\"(?:\\\\[^\"]|\\\\\"|.)*?\")|(?s)/\\*.*?\\*/` How can I make it so that it will only find /* ... */ comments and not single line comments ( // ... ) ? – hyde Apr 17 '13 at 07:28
  • Hmm, looks like it does not work for cases like this: `/** Prompt the user to type in the number. Assign the number to v */` This thing not letting me blanks line I will update the question. – hyde Apr 17 '13 at 07:58
  • @NullGeo: To get rid of the newlines I would just add a `.replaceAll(System.getProperty("line.separator"), "")` – Keppil Apr 17 '13 at 08:33
  • @Keppil: You should make a second pass to remove the line separators. And don't just remove them; you may end up running words together. What you want to do is normalize the remaining whitespace (e.g. `.replaceAll("\\s+", " ");`). As for the `line.separator` property, see [this answer](http://stackoverflow.com/a/247597/20938) for a discussion of its disutility. – Alan Moore Apr 17 '13 at 09:24
  • @AlanMoore: Sure, if line feeds need to be replaced by a space, then a second pass is needed. There might be other small adjustments that need to be made depending on OPs use case, but I think it is fairly trivial to tweak the code above to fit such extra demands though. – Keppil Apr 17 '13 at 10:06
  • No worries. My biggest objection is to your use of the `line.separator` property. There's no reason to expect all line separators in a given file to be the same as the current platform's default (or more accurately, what Java *thinks* is the platform default). If you're on a Windows machine and the file contains only Unix-style separators, you'll be trying to match `\n` with `\r\n`, so nothing will happen. Other way 'round, you'll remove all the `\n`s and leave the `\r`s in place. – Alan Moore Apr 17 '13 at 13:55
  • @AlanMoore: Fair enough, changed to a more general approach. – Keppil Apr 17 '13 at 13:58