1

I have read a lot of topics here about matching and capturing the string between curly braces in text, but didn't find an answer, for matching and capturing the content of the functions (specially in case there some logic inside). So hope this topic won't be a duplicate.

I need to match several things in code files (I have a lot of them, and all of them has similar structure, but different depth), like the one below.

Here are things I need to capture:

  1. Main class name

  2. Sub classes names

  3. Sub classes functions names

  4. Content of each function

I need first 3 to scan all our projects, to map where those files (and their functions) are in use.

The last one is needed to match it againist the list specific services (internal and external) that can be used in those functions.

Code sample:

namespace Myprogramm.BusinessLogic
{
    public static class Utils
    {
        public static class Services
        {
            public static int GetSomeIDBySomeName()
            {
                // call some webservice
            }

            public static void UpdateViews()
            {
                // send some request
            }

            public static void IncreaseViews(int views)
            {
                if (views < 1000)
                {
                    // execute SQL SP1
                }
                else
                {
                    // execute SQL SP2
                }
            }
        }

        public static class SomeApi
        {
            public int OpenSomeSession(int someId)
            {
                if (someId < 0)
                {
                    // do something...
                }
                else
                {
                    // do something else ...
                }
            }
        }
    }
}

What I'm attempting to do, is to read those files as text, and to match their content against some regular expressions to capture the things I need.

I'm new at regular expressions. So I didn't achieved a lot of success here. I can't figure out, how can I match and capture the content of sub classes, and then how can I do the same thing for the functions.

I tried to work with this one (in another task) to capture the content of the simple functions (with no logic inside):

/{([^}]*)}/

And with this (also in another task to get content of the main class/namespace):

/{([\s\S]*)}/

And I do understand, why this doesn't help me here in this task.

To be clear, first of all I need to capture this one (to get the main class name) and it's content:

public static class Utils {...}

*** this one I actually understand

Then those two (to capture sub classes names and their content):

1.

public static class Services {...}

2.

public static class SomeApi {...}

And then (just for the first sub class as an example):

1.

public static int GetSomeIDBySomeName() {...}

2.

public static void UpdateViews() {...}

3.

public static void IncreaseViews(int views) { if (views < 1000) {...} else {...} }
Jonny 5
  • 12,171
  • 2
  • 25
  • 42
neoselcev
  • 138
  • 12
  • You need different capture groups within the namespace block ? Perhaps you should apply different regexes to the resulting capture groups. – Veverke Aug 30 '15 at 14:52
  • I know that. I can you `/{([\s\S]*)}/` to get the namespace and it's inside first, and then I can use it again on the result, to get the main class and it's inside. But I can't figure what should be the regular expression, to capture two sub classes and their inside, from the inside of the main class. – neoselcev Aug 30 '15 at 15:08
  • See *Matching Nested Constructs* in [Friedl's book](https://books.google.com/books?id=P5UXAwAAQBAJ&lpg=PA328&ots=HAjQ68fgTv&hl=en&dq=mastering%20regular%20expressions%20nested%20constructs&pg=PA436#v=onepage&q=Matching%20Nested%20Constructs&f=false): `{(?>[^{}]+|{(?)|}(?<-x>))*(?(x)(?!))}` Something [like this](http://regexhero.net/tester/?id=2d30cc1f-709c-492b-be22-660ca1c757fc). – Jonny 5 Aug 30 '15 at 15:28
  • @neoselcev I think it's difficult to done this by using standalone regex, I have one idea to done this by using regex but together with detecting indentation of your code. Are you interesting? – fronthem Aug 30 '15 at 16:00
  • @terces907 I'm using regex, because I couldn't think off anything else to accomplish the task. Can you please explain, what is your idea? I'll be happy to hear and understand another ideas. – neoselcev Aug 30 '15 at 16:06
  • @Jonny5 First of all WOW! It works! It helps me to capture the segments' like subclass and functions (though it doesn't capture their inside, do you know why?). I'll definitely read this book, but it'll take me a while, can you please explain you answer? – neoselcev Aug 30 '15 at 16:07
  • You just detect level of indentation in your code e.g. starting of class starts with `^\s{4}.*class.*$` and end with `^\s{4}\}$`, here is simple you can detect scope of class now, One limitation is code should be best practice because we detect on indentation. – fronthem Aug 30 '15 at 16:09
  • @neoselcev Did you try with capture group? `{((?>[^{}]+|{(?)|}(?<-x>))*(?(x)(?!)))}` grab `$1` – Jonny 5 Aug 30 '15 at 16:14
  • @terces907 Thank you for the idea. But there is two problems I can think of right now: first of all, as I understand it, it is not a too good practice to rely on counting indentation, as far as I'm using regex (I choose regex, as I can find "templates" relying only on the main syntax of the code), second is that, the code is not mine, and I can't be sure, that the code is well formatted. – neoselcev Aug 30 '15 at 16:18
  • @Jonny5 Yes, I did, and $1 group always returns empty. I can workaround, using `/{([\s\S]*)}/` on the result to get the inside, but it will a patch... – neoselcev Aug 30 '15 at 16:24
  • @neoselcev Why don't you count bracket manually in loop, to know exactly structure of code you suppose write a program such as `Lex` or `Yacc`. If you want to make it easy I can give you some idea by detecting keyword and counting pair of brackets till end of class, subclass, etc. – fronthem Aug 30 '15 at 16:25
  • @neoselcev [Here's a tutorial](http://www.regular-expressions.info/balancing.html) on matching nested constructs and [it's also explained here](http://weblogs.asp.net/whaggard/377025) similar your scenario. I have no .NET/#C experience and could not explain better. Somebody experienced will answer :] Also [tested it here](http://goo.gl/l01rrc). – Jonny 5 Aug 30 '15 at 16:25
  • @Jonny5 Thank you! Can you please take your comment to a separate answer, so I could vote, when I'll get it to work? – neoselcev Aug 30 '15 at 16:30
  • If the sample code is correct and compiled, then why not use `CSharpCodeProvider`? With it you can easily get the first three things. Except fourth... – Alexander Petrov Aug 30 '15 at 16:33
  • @terces907 I thought about scanning the code, running through the countless numbers of loops, but it seemed to be much more complicated to write this code, to support it, and to explain it to the others... Regex seems to be much more "readable", and flexible (may be I'm mistaken). And about implementing something like, Lex program, will be not allowed in my company from the security issues (can;t explain it). But thank you! May be, if there'll be more complicated task, I'll try this way of thinking. – neoselcev Aug 30 '15 at 16:37
  • @AlexanderPetrov Nice idea, I'll look into it. – neoselcev Aug 30 '15 at 16:39
  • @neoselcev Please check if my edit of your question-title is ok! – Jonny 5 Aug 30 '15 at 17:25
  • @Jonny5 Great! Thanks! – neoselcev Aug 30 '15 at 17:34

2 Answers2

1

In Jeffrey Friedl's book Mastering Regular Expressions there's a suitable sample on page 436.
How to match nested constructs is also explained at regular-expressions.info or weblogs.asp.net.

The example in the sources changed to braces would result in something like this:

{(?>[^{}]+|{(?<x>)|}(?<-x>))*(?(x)(?!))}

Where x corresponds to the nested depth. Test it at regexhero.net

  • (?> opens an atomic group
  • [^{}] matches a character, that is not a brace
  • {(?<x>) ads to depth
  • }(?<-x>) subtracts from depth/stack
  • (?(x)(?!)) ensures depth is zero before meeting final }

Reference - What does this regex mean

Community
  • 1
  • 1
Jonny 5
  • 12,171
  • 2
  • 25
  • 42
  • Thank you, it saved my day! The only question left, is: how I turn it to capture the inside of the curly braces (without workarounds)? Adding `()` to it's edges: `{((?>[^{}]+|{(?)|}(?<-x>))*(?(x)(?!)))}` to capture group $1 doesn't help. – neoselcev Aug 30 '15 at 16:56
  • 1
    @neoselcev Welcome! [tried at regexstorm.net](http://goo.gl/l01rrc) - when you click on "table" it shows the desired capture for `$1`. Also could try with lookarounds: `(?<={)(?>[^{}]+|{(?)|}(?<-x>))*(?(x)(?!))(?=})`. If it doesn't work, why not just strip the outer `{` `}` and trim. – Jonny 5 Aug 30 '15 at 17:12
  • My code is a bit more complicated, a I have more groups to match in the single line, I'll try it to. Probably the problem is in my code. Thanks! – neoselcev Aug 30 '15 at 17:35
  • After bit refactoring adding `()` at the edges worked! Thanks! – neoselcev Sep 08 '15 at 18:43
0

In general, nested something languages are in a different cathegory (context free languages) than the languages defined by regular expressions (regular languages). Regular languages have grammars that don't allow nesting, and are parsed efficiently with a deterministic or nondeterministic finite state automaton. Context free languages need at least a stack based automaton, that allows to somewhat store the level of parenthesis in some place (in this case the stack) To be able to parse nested parenthesis expressions with a regexp, you need to convert those languages first and make them to appear almost like a context free language, but not so. Just put an upper bound to the level of parenthesis that you allow your language to parse and you'll have a regular language. Only then you can convert a context free language to a regular one.

With the extensions of some languages (like perl or python) make to regexp, there is some way to cope partially (but not generally) with this.

In your case, you have up to five levels of parenthesis (counting not only curly brackets, but plain parenthesis also). Your automata (and the regular expression that allows five levels to be parsed) will be complex, anyway.

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31