1

(For those who meet the same case, pls notice that this problem might be .net and C# specified. See answer of Wiktor below.)

Before asking this question, I've read many related questions (including this: Match linebreaks - \n or \r\n?), but none of those answers worked.

In my case, I want to remove all //comments in some code files. To deal with files in Mac, Unix, Windows, I need something to match text between // and /r, or /n, or /r/n.

Here is the test content of code file:

        var text = "int rn = 0; //comment1.0\r\n" +
                   "int r = 0; //comment2.\r" + 
                   "int n = 0; //comment3.\n" + 
                   "end";
        var txt = RemoveLineEndComment();

And here is the regex(if you are not a C charper, just focus on the regex pls):

public static class CommentRemover
{
    private static readonly Regex RegexRemoveLineEndComment =
        new(@"\/\/.*$", RegexOptions.Multiline);
    public static string RemoveLineEndComment(this string text)
    {
        var t = RegexRemoveLineEndComment.Match(text).Value;
        return RegexRemoveLineEndComment.Replace(text, string.Empty);
    }
}

What I need is txt = "int rn = 0; \r\nint r = 0; \rint n = 0; \nend". Here are regexes and corresponding results:

//.*$ => txt="int rn = 0; \nint r = 0; \nend" (int n = 0 is missing)

//.*(?=\r\n) => txt="int rn = 0; \r\nint r = 0; //comment2.\rint n = 0; //comment3.\nend" (comment2 and 3 are left)

//.*(?=\r?\n?) => txt="int rn = 0; \nint r = 0; \nend" (int n = 0 is missing)

//.*(?=(\r\n|\r|\n)) => txt="int rn = 0; \nint r = 0; \nend" (int n = 0 is missing)

//.*(?=[\r\n|\r|\n]) => txt="int rn = 0; \nint r = 0; \nend" (int n = 0 is missing) ...

Seems there is something wrong with \r and it cannot be identified. If I only work with \r\n, the regex "//.*(?=\r\n)" works fine for the test content bellow:

        var text = "int rn = 0; //comment1.0\r\n" +
                   "int r = 0; //comment2.\r\n" + 
                   "int n = 0; //comment3.\r\n" + 
                   "end";

Anyone help me out? thanks for any help.

cheny
  • 2,545
  • 1
  • 24
  • 30
  • Please include a tag for the language. – Barmar Oct 21 '21 at 15:36
  • @Barmar Sorry, thought it's pure regex problem. But as Wiktor mentioned bellow, it might be a .net problem. I would include the language tag if so. :) – cheny Oct 22 '21 at 04:16
  • Is there any possibility the code you're processing might contain e.g. string literals which contain the `//` sequence and which *shouldn't* be treated as comments? – Damien_The_Unbeliever Oct 22 '21 at 09:05
  • @Damien_The_Unbeliever No. Not in this test code. But it did happen in my old version (no regex used in that version). There might be code like text = "//hello" and //"hello". I'm quit new of regex and will deal with these complex cases later :) – cheny Oct 23 '21 at 07:50

1 Answers1

1

In .NET, the . pattern matches carriage return (CR) chars. It matches any chars but an LF char.

Note there is no option or modifier to redefine this . behavior.

Thus, you can use

var RegexRemoveLineEndComment =  new Regex(@"//[^\r\n]*", RegexOptions.Multiline);

See the C# demo.

If you want to remove also whitespace before //, add the \s* (any whitespace) or [\p{Zs}\t]* (horizontal whitespace) at the pattern start.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks for your help! But...Strangely, I copied your code from the link into a new project, and I got "int rn = 0; \nint r = 0; \nend"(int n = 0 is missing) again. The \r is still not identified, which is really weird. Would difference between versions matter? (I'm using .net5) I'd like to try something with \r only, to see if there is any other clue. Thanks again. – cheny Oct 22 '21 at 04:13
  • @cheny Since I do not have your code, I cannot help more. I have just noticed that `//[^\r\n]*` regex does not need the `RegexOptions.Multiline` option, it can be removed. – Wiktor Stribiżew Oct 22 '21 at 07:01
  • Ahh! Finally, I know what really happened. I did get the result of txt="int rn = 0; \r\nint r = 0; \rint n = 0; \nend", but when I use Console.WriteLine(txt), something tricky happened! "\r\n" means "return to the start of CURRENT line, and then go the next line, and print the following letters in the new line", while "\r" alone means “return to the start of CURRENT line, and print the following letters in CURRENT line". So, "int r = 0; " was covered by the "int n = 0; "! I put a temporary parameter in the code and discovered that. It happens in both console and text area component. – cheny Oct 22 '21 at 08:51
  • 1
    Thanks a lot again. Your point of .NET matching the \r as . is the key. – cheny Oct 22 '21 at 08:55
  • I did :) @Wiktor – cheny Oct 22 '21 at 08:58