2

The Enum I want to extract is like following:

...
other code 
...
enum A
{
  a,
  b=2,
  c=3,
  d//{x}
}
...
More Enums like the above.
...

First, I have tried using the Option Singleline with Regex:
enum\s*\w+\s*{.*?\}

However, since the comments have brackets.The regex doesn't work. It will stop when it runs to the bracket in comments.

So I tried excluding the bracket after comments. Based on what I have searched so far,it seems I need Negative look ahead with grouping construct Multiline.

Then I tried parsing the brackets without comments ahead.
The substep is to find brackets after comments: (?m:^.*?//.*?}.*?$).

However, it seems the . still match anychar including newline even in inline multiline mode.

Then I tried using multiline in the first place. Since the main problem is the brackets in comments.I tried:
(?!//.*)} Negative look ahead doesn't work the way I expected.

Here is a csharp-regex-test-link for you to test.

To summarize, I need parse enum from a csharp source code file.

The main problem to me is the brackets in comments.

Edit: To clarify

1.brackets in comments are in pairs. For example:

xxx=xxx; //{xx} 

2.comments are only in the form of //

3.I can't rely on indentations.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
AlexWei
  • 1,093
  • 2
  • 8
  • 32
  • 1
    Not sure if `.NET` supports recursion, but if so, you could use https://regex101.com/r/AAuHg2/1/ If not, you could use balanced group constructs - https://learn.microsoft.com/en-us/dotnet/standard/base-types/grouping-constructs-in-regular-expressions#balancing_group_definition – Jan Jan 23 '19 at 10:55
  • 1
    If your code is well-indented (with starting and ending `{` on their own lines), you may leverage that: `(?ms)enum\s*\w+\s*^{.*?^}\r?$`. You can't rely on balanced groups because `{` and `}` in the comments do not have to be balanced. Recursion would not have helped had it been there in .NET regex. – Wiktor Stribiżew Jan 23 '19 at 10:55
  • @Jan That won't work because comments may contain `// text } here` – Wiktor Stribiżew Jan 23 '19 at 10:56
  • @WiktorStribiżew: Why should recursion not help here? See https://regex101.com/r/AAuHg2/1/ – Jan Jan 23 '19 at 10:56
  • 1
    @Jan https://regex101.com/r/AAuHg2/2 – Wiktor Stribiżew Jan 23 '19 at 10:57
  • @WiktorStribiżew I can't rely on indentation. But the brackets in comments are in pairs. Thanks. I will try to translate yours into .NET regex to see if it helps. – AlexWei Jan 23 '19 at 11:03
  • You may try `@"\benum\s*\w+\s*{(?>[^{}]+|(?){|(?<-o>)})*(?(o)(?!)|)}"`. See [this demo](http://regexstorm.net/tester?p=%5cbenum%5cs*%5cw%2b%5cs*%7b%28%3f%3e%5b%5e%7b%7d%5d%2b%7c%28%3f%3co%3e%29%7b%7c%28%3f%3c-o%3e%29%7d%29*%28%3f%28o%29%28%3f!%29%29%7d&i=enum+A%0d%0a%7b%0d%0a++a%2c%0d%0a++b%3d2%2c%0d%0a++c%3d3%2c%0d%0a++d%2f%2f%7bx%7d%0d%0a%7d%0d%0a...%0d%0aMore+Enums+like+the+above.%0d%0a...). – Wiktor Stribiżew Jan 23 '19 at 11:08
  • 1
    Isn't the attempt to parse source code with Regex the same wrong approach as [parsing HTML with Regex](https://stackoverflow.com/a/1732454/107625)? – Uwe Keim Jan 23 '19 at 11:09
  • @WiktorStribiżew It works. Thanks. – AlexWei Jan 23 '19 at 11:11
  • 1
    @UweKeim Probably right. However it depends. I have used roslyn to parse c# code. However, there are some constraints in production. Using the right way may not be the right solution.But thanks for your suggestion. – AlexWei Jan 23 '19 at 11:15

2 Answers2

3

You may use

@"\benum\s*\w+\s*{(?>[^{}]+|(?<o>){|(?<-o>)})*(?(o)(?!)|)}"

See the regex demo

Details

  • \benum - a whole word enum
  • \s* - 0+ whitespaces
  • \w+ - 1+ word chars
  • \s* - 0+ whitespaces
  • { - a { char
  • (?>[^{}]+|(?<o>){|(?<-o>)})* - either 1+ chars other than { and }, or a { with an empty string pushed onto the Group o stack, or } with a value popped from Group o stack
  • (?(o)(?!)|) - a conditional yes-no construct that fails the match and makes the regex engine backtrack at the current location if Group o still has any items left on the stack
  • } - a } char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • About the conditional yes-no, why you add `|`? What is the difference between `(?(o)(?!))` and `(?(o)(?!)|)`? – AlexWei Jan 24 '19 at 02:34
  • @AlexWei It is usually used without `|` in the balanced construct, but to keep it consistent with the regex grammar, the empty no part pattern is welcome. In other situations, missing no part might lead to unexpected results. – Wiktor Stribiżew Jan 24 '19 at 07:14
1

I don't think it is possible to do your task with a single regex. What if you have a string that looks like

var notEnum = "enum A {a, b, c}";

Hovewer you can capture your enums with few passes. Take a look at this algorithm

  1. Clear strings content
  2. Drop singleline comments
  3. Drop muliline comments
  4. Use you original regex

Example:

var code = ...

var stringLiterals = new Regex("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"", RegexOptions.Compiled);
var multilineComments = new Regex("/\\*.*?\\*/", RegexOptions.Compiled | RegexOptions.Singleline);
var singlelineComments = new Regex("//.*$", RegexOptions.Compiled | RegexOptions.Multiline);
var @enum = new Regex("enum\\s*\\w+\\s*{.*?}", RegexOptions.Compiled | RegexOptions.Singleline);

code = stringLiterals.Replace(code, m => "\"\"");
code = multilineComments.Replace(code, m => "");
code = singlelineComments.Replace(code, m => "");

var enums = @enum.Matches(code).Cast<Match>().ToArray();

foreach (var match in enums)
    Console.WriteLine(match.Value);
Aleks Andreev
  • 7,016
  • 8
  • 29
  • 37