0

I want to remove (Java/C/C++/..) multiline comments from a file. For this, I have written a regular expression:

/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/

This regular expression works well with Nodepad++ and Geany (search and replace all with nothing). The regex behaves differently in VB.NET.

I am using:

Microsoft Visual Studio 2010 (Version 10.0.40219.1 SP1Rel)
Microsoft .NET Framework (4.7.02053 SP1Rel)

The file I'm running replacements on is not that complex. I do not need to take care of any quoted text that might start or end comments.

@sln thank you for your detailed reply, I'll also quickly explain my regex as nicely as you did!

/\*                      Find the beginning of the comment.
[^\*]*                   Match any chars, but not an asterisk.
                         We need to deal with finding an asterisk now:
(\*+[^\*/][^\*]*)*       This regex breaks down to:
 \*+                     Consume asterisk(s).
    [^\*/]               Match any other char that is not an asterisk or a / (would end the comment!).
          [^\*]*         Match any other chars that are not asterisks.
(               )*       Try to find more asterisks followed by other chars.

\*+/                     Match 1 to n asterisks and finish the comment with /.

Here are two code snippets:

First:

text

/*
 * block comment
 *
 */ /* comment1 */ /* comment2 */

My text to keep.

/* more comments */

more text

Second:

text

/*
 * block comment
 *
 */ /* comment1 *//* comment2 */

My text to keep.

/* more comments */

more text

The only difference is the space between

/* comment1 *//* comment2 */

Deleting found matches with Notepad++ and Geany works perfectly for both cases. Using regular expressions from VB.NET fails for the second example. The result for the second example after deletion looks like this:

text



more text

But it should look like this:

text



My text to keep.



more text

I am using System.Text.RegularExpressions:

Dim content As String = IO.File.ReadAllText(file_path_)
Dim multiline_comment_remover As Regex = New Regex("/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/")
content = multiline_comment_remover.Replace(content, "")

I would like to have the same results with VB.NET as with Notepad++ and Geany. As answered by sln, my regex "should work in a weird way". The question is why does VB.NET fail to process this regex as intended? This question is still open.

Since sln's answer got my code working, I'll accept this answer. Although this doesn't explain why VB.NET doesn't like my regex. Thanks for all your help! I learned a lot!

  • 1
    VBA? Don't you mean VB.NET? – Mathieu Guindon Jul 16 '18 at 16:40
  • BTW make your pattern a *verbatim string*, `New Regex(@"...")` (works in C#, not sure about VB.NET)... or escape the backslashes? – Mathieu Guindon Jul 16 '18 at 16:42
  • hmm [nevermind, VB strings are always taken literally](https://stackoverflow.com/q/13155449/1188513) – Mathieu Guindon Jul 16 '18 at 16:54
  • @MathieuGuindon You are right. This is VB.NET. My bad. I'm using the IDE "Visual Basic 2010 Express" and I thought this is VBA ... – Roman Holler Jul 16 '18 at 19:03
  • VBA is edited in the *Visual Basic Editor* (VBE) a *hosted* IDE that lives in-process in the host app, e.g. Excel. You're using the *Visual Studio Express* IDE to write *Visual Basic 2010* code ;-) – Mathieu Guindon Jul 16 '18 at 19:04
  • In a weird way, your regex should work. However, note that what you're attempting is a modified _unrolled_loop_. The premise of this technique is to always end the loop with character(s) just before the closing delimiter. Example `/\*` _this section_-> `[^*]*\*+` <-_here_ ... <_start loop_> `(?:[^/*]` _and this section_-> `[^*]*\*+` <-_here_ `)*` <_end loop_> `/` –  Jul 17 '18 at 01:01
  • Just so you know, Java use C++ style comments. It's not good enough to to parse just `/*..*/` you have to parse `//` as well. There is always options to not replace one type if need be, but you _cannot_ just parse a single type. Also, to complete the picture, you have to parse double quoted text at the same time as they can hide comment syntax, and visa-versa. If you say this is a one-off thing, then who cares. Just use this `/\*[^*]*\*+(?:[^/*][^*]*\*+)*/` –  Jul 17 '18 at 01:08
  • Thanks again! I'm aware of the `// single-line` comments. I'm using a different regex to delete them beforehand. Although, I didn't understand the unrolled_loop thingy fully yet. I need to dive deeper into this topic and read more about it. – Roman Holler Jul 23 '18 at 16:09

1 Answers1

0

I think you could use a generalized C++ comment stripper.

It's basically
Glbolly find with below, replace with $2

Demo PCRE: https://regex101.com/r/UldYK5/1
Demo Python: https://regex101.com/r/avfSfB/1

    # raw:   (?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
    # delimited:  /(?m)((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:\/\*|\/\/))|[^\/"'\\\r\n]*))+|[^\/"'\\\r\n]+)+|[\S\s][^\/"'\\\r\n]*)/

    (?m)                             # Multi-line modifier
    (                                # (1 start), Comments
         (?:
              (?: ^ [ \t]* )?                  # <- To preserve formatting
              (?:
                   /\*                              # Start /* .. */ comment
                   [^*]* \*+
                   (?: [^/*] [^*]* \*+ )*
                   /                                # End /* .. */ comment
                   (?:                              # <- To preserve formatting
                        [ \t]* \r? \n
                        (?=
                             [ \t]*
                             (?: \r? \n | /\* | // )
                        )
                   )?
                |
                   //                               # Start // comment
                   (?:                              # Possible line-continuation
                        [^\\]
                     |  \\
                        (?: \r? \n )?
                   )*?
                   (?:                              # End // comment
                        \r? \n
                        (?=                              # <- To preserve formatting
                             [ \t]*
                             (?: \r? \n | /\* | // )
                        )
                     |  (?= \r? \n )
                   )
              )
         )+                               # Grab multiple comment blocks if need be
    )                                # (1 end)

 |                                 ## OR

    (                                # (2 start), Non - comments
         # Quotes
         # ======================
         (?:                              # Quote and Non-Comment blocks
              "
              [^"\\]*                          # Double quoted text
              (?: \\ [\S\s] [^"\\]* )*
              "
           |                                 # --------------
              '
              [^'\\]*                          # Single quoted text
              (?: \\ [\S\s] [^'\\]* )*
              '
           |                                 # --------------

              (?:                              # Qualified Linebreak's
                   \r? \n
                   (?:
                        (?=                              # If comment ahead just stop
                             (?: ^ [ \t]* )?
                             (?: /\* | // )
                        )
                     |                                 # or,
                        [^/"'\\\r\n]*                    # Chars which doesn't start a comment, string, escape,
                                                         # or line continuation (escape + newline)
                   )
              )+
           |                                 # --------------
              [^/"'\\\r\n]+                    # Chars which doesn't start a comment, string, escape,
                                               # or line continuation (escape + newline)

         )+                               # Grab multiple instances

      |                                 # or,
         # ======================
         # Pass through

         [\S\s]                           # Any other char
         [^/"'\\\r\n]*                    # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)

    )                                # (2 end), Non - comments

If you use a particular engine that doesn't support assertions,
then you'd have to use this.
This won't preserve formatting though.

Usage same as above.

    # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)


    (                                # (1 start), Comments 
         /\*                              # Start /* .. */ comment
         [^*]* \*+
         (?: [^/*] [^*]* \*+ )*
         /                                # End /* .. */ comment
      |  
         //                               # Start // comment
         (?: [^\\] | \\ \n? )*?           # Possible line-continuation
         \n                               # End // comment
    )                                # (1 end)
 |  
    (                                # (2 start), Non - comments 
         "
         (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
         "
      |  '
         (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
         ' 
      |  [\S\s]                           # Any other char
         [^/"'\\]*                        # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)
    )                                # (2 end)