10

I have a giant C project with numerous C files. I have to find all inner-loops. I am sure there is no any O(n³) block in the project, so only O(n²)-compexity blocks must be found (a loop in a loop).

Is it possible to find all inner loops using grep? If yes, what regexp may I use to find all occurrences of inner loops of all kinds like {for,for}, {while,for}, {for, while}, {do, while}, etc. ? If no, is there any simple unix-way method to do it (maybe multiple greps or a kind of awk)?

psihodelia
  • 29,566
  • 35
  • 108
  • 157
  • 11
    `I am sure there is no O(n^3) block` ... *really?* Do you know the complexity of every function you are calling (especially within a loop)? – fredley Nov 25 '11 at 14:20
  • 1
    Sounds like a lost cause to me, sorry for saying that, but @TomMedley 's note is really reasonable. – Kiril Kirov Nov 25 '11 at 14:29
  • 4
    Identifying "loops withing loops" is clearly not enough. You need to analyze everything (`for (;i<= n*n*n*n;)`? `while(1)`? `goto`?) – Mat Nov 25 '11 at 14:31
  • @Mat: + recursive calls, nesting loops in loops in loops in loops... – Alexey Frunze Nov 25 '11 at 14:34
  • 4
    Why don't you use a profiler? – ckruse Nov 25 '11 at 14:38
  • 2
    @Alex: let's not forget `setjmp`/`longjmp` ... – Mat Nov 25 '11 at 14:39
  • Recursive calls are forbidden in this project (embedded, multi-platform). As well as while(1), goto, and f(n) as loop conditions. – psihodelia Nov 25 '11 at 14:42
  • 3
    Forbidden doesn't mean absent. – mouviciel Nov 25 '11 at 15:15
  • How giant is your project? More than a million lines of C code?? – Basile Starynkevitch Nov 25 '11 at 15:29
  • 1
    @psihodelia Even assuming all those are absent, say you have a loop like this: `for(x) { for(y) { call(x,y); } }`, and inside `call()` you have a loop or two, maybe even nested. Any inner loops are automatically nested (when called from there, not necessarily every time). – Kevin Nov 25 '11 at 15:58
  • coccinelle (http://coccinelle.lip6.fr/) may be better suited for this than grep. – ninjalj Nov 25 '11 at 19:07
  • 1
    Embedded system --> no recursion. As a first order principle, you'd expect this. Anywhere there is a recursive data structure (tree? red-black list? quicksort?) you'll get a recursive algorithm even if it is an embedded system. So, big embedded systems do contain recursion. Just not lots of it. – Ira Baxter Nov 25 '11 at 21:01

5 Answers5

12

Regex are for regular languages, what you are describing seems like a Context-Free, and i'm pretty sure it can't be done using Regular Expressions. See the answer to a similar question here . You should look for other type of automata like a scripting language(python or so).

Community
  • 1
  • 1
6D65
  • 775
  • 6
  • 14
  • Why? Let's say to find all those blocks which start with for and include another for in their {} bodies. The same rule for {while,for}, {for,while}, etc. Is it possible with regexps? Sure. – psihodelia Nov 25 '11 at 14:45
  • 3
    @psihodelia You can't use a regex to keep track of things like matching braces, so while you can look for a `for` after another `for` you can't say if it's inside the first `for` or not. – Some programmer dude Nov 25 '11 at 15:00
  • I see what you want. But you'll find only the word "while" followed by the braces and another "while", it would be impossible to select the inner block of text which represents the thing you are looking for. Regular Expressions are pattern matching machines, you'll need a higher order automata to put some logic in it. – 6D65 Nov 25 '11 at 15:07
  • @6D65 I need just a line number where first for in a for*{*for pattern was found. – psihodelia Nov 25 '11 at 15:30
  • 2
    @psihodelia You can try with something like `grep 'for([^,]*,[^,]*,[^)]*){[^}]*for'`, but that will fail on a great many cases, and you can't do much better because regexes can't match parens or braces, so if you have an `if{...}` within the nest, you can't distinguish that `}` from the ending brace for your `for(...){...}`. – Kevin Nov 25 '11 at 16:05
  • using extended Regexes (Perl) you can at least detect some basic context free languages like start -> A(start)B | epsilon. Of course, that does not count as regex in the formal CS. See [this SO post](http://stackoverflow.com/questions/7983115/are-perl-regexes-turing-complete). – mbx Nov 25 '11 at 21:00
5

This is a good case for specific compiler extensions. The recent GCC compiler (that is version 4.6 of GCC) can be extended by plugins (painfully coded in C) or by MELT extensions; MELT is a high-level domain specific language to code GCC extensions in, and MELT is much easy to use than C.

However, I do admit that coding GCC extensions is not entirely trivial: you have to partly understand how GCC works, and what are its main internal representations (Gimple, Tree, ...). When extending GCC, you basically add your own compiler passes, which can do whatever you want (-including detecting nested loops-). Coding a GCC extension is usually more than a week of work. (The hardest part is to understand how GCC works).

The big advantage of working within the GCC framework (thru plugins in C or extensions in MELT) is that your extensions are working on the same data as the compiler does.

Back to the question of finding nested loops, don't consider it as only purely syntactic (this is why grep cannot work). Within the GCC compiler, at some level of internal representations, a loop implemented by for, or while, or do, or even with goto-s, is still considered a loop, and for GCC these things can be nested (and GCC knows about the nesting!).

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
3

Without a C parser, you can get a heuristic solution at best.

If you can rely on certain rules being consistently followed in the code (e.g. no goto, no loops through recursion, ...), you can scan the preprocessed C code with regexps. Certainly, grep is not sophisticated enough but with a few lines of Perl or similar it is possible.

But the technically better and much more reliable approach is to use a real C parser.

undur_gongor
  • 15,657
  • 5
  • 63
  • 75
1

There are three kinds of loops in C:

  • "structured syntax" (while, for, ...) [Watch out for GCC, which can hide statements therefore loops inside expressions using (stmt; exp) syntax!]
  • ad hoc loops using goto; these interact with the structured syntax.
  • recursion

To find the first kind, you have to find the structured syntax and the nesting.

Grep can certainly find the keywords (if you ignore false positives in comments and strings), but it can't find nested structures. You could of course use grep to find all the loop syntax and then simply inspect those that occurred in the same file to see if they were nested. (If you wanted to do this without the false positives price, you could use our Source Code Search Engine, which knows the lexical syntax of C and is never confused as to when a string of characters is a keyword, a number, a string, etc.)

If you want to find those loops automatically, you pretty much need a full C parser with expanded preprocessing accomplished. (Otherwise some macro may hide a critical piece of loop syntax). Once you have a syntax tree for C, it is straightforward (although likely a bit inconvenient) to code something that clambers over the tree, detecting loop syntax nodes, and counting nesting of loops in subtrees. You can do this with any tool that will parse C and give you abstract sytnax trees. ANTLR can likely do this; I think there's a C grammar obtainable for ANTLR that handles C fairly well, but you'll have to run the preprocessor before using ANTLR.

You could also do this with our DMS Software Reengineering Toolkit with its C Front End. Our C Front End has a full preprocessor built in so it can read the code directly and expand as it parses; it also handles a relatively wide variety of dialects of C and character encodings (ever dealt with C containing Japanese text?). DMS provides an additional advantage: given a language (e.g., C) front end, you can write patterns for the that language directly using the language syntax. So we can express fragments of what we want to find easily:

 pattern block_for_loop(t:expression,l:expression,i:expression, s: statements): statement
     " for(\t,\l\,\i) { \s } ";

 pattern statement_for_loop(t:expression,l:expression,i:expression, s: statement): statement
     " for(\t,\l\,\i)  \s ";

 pattern block_while_loop(c:expression, s: statements): statement
     " while(\c) { \s } ";

 pattern statement_while_loop(c:expression): statement
     " for(\c)  \s ";

 ...

and group them together:

 pattern_set syntactic_loops
     { block_for_loop,
       statement_for_loop,
       block_while_loop,
       block_statement_loop,
       ...
     }

Given the pattern set, DMS can scan a syntax tree and find matches to any set member, without coding any particular tree crawling machinery and without knowing a huge amount of detail about the structure of the tree. (There's a lot of node types in an AST for a real C parser!) Finding nested loops this way would be pretty straightforward: scan the tree top down for a loop (using the pattern set); any hits must be top level loops. Scan subtrees of a found loop AST node (easy when you know where the tree for the outer loop is) for additional loops; any hits must be nested loops; recurse if necessary to pick up loops with arbitrary nesting. This works for the GCC loops-with-statements stuff, too. The tree nodes are stamped with precise file/line/column information so its easy to generate a report on location.

For ad hoc loops built using goto (what, your code doesn't have any?), you need something that can produce the actual control flow graph, and then structure that graph into nested control flow regions. The point here is that a while loop that contains an unconditional goto out isn't really a loop in spite of the syntax; and an if statement whose then clause gotos back to code upstream of the if is likely really a loop. All that loop syntax stuff is really just hueristic hints you may have loop! Those control flow regions contain the real nesting of the control flow; DMS will construct C flow graphs, and will produce those structured regions. It provides libraries to build and access that graph; this way you can get the "true" control flow based on gotos. Having found a pair of nested control flow regions, one can access the AST associated with parts of region to get location information to report.

GCC is always a lot fun due to its signficantly enhanced version of C. For instance, it has an indirect goto statement. (Regular C has this hiding under setjmp/longjmp!). To figure out loops in the face of this, you need points-to analysis, which DMS also provides. This information is used by the nested region analysis. There are (conservative) limits to the accuracy of points-to analysis, but modulo that you get the correct nested region graph.

Recursion is harder to find. For this, you have to determine if A calls B ... calls Z calls A, where A and B and ... can be in separate compilation units. You need a global call graph, containing all the compilation units of your application. At this point, you are probably expecting me to say that DMS does that too, voila, I'm pleased to say it does. Constructing that call graph of course requires points-to anlaysis for function calls; yes, DMS does that too. With the call graph, you can find cycles in the call graph, which are likely recursion. Also with the call graph, you can find indirect nesting, e.g., loops in one function, that call a function in another compilation unit that also contains loops.

To find structures such a loops accurately you need a lot of machinery (and this will take some effort, but then C is a bitch of a language to analyze) and DMS can provide it.

If you don't care about accuracy, and you don't care about all the kinds of loops, you can use grep and manual procedures to get a semi-accurate map of just the loops hinted at by the structured loop statements.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
0

I suspect that finding something like this would be impossible using grep alone:

public void do(){
    for(...){
        somethingElse();
    }
}

... Insert other code...

public void somethingElse(){
    for(.....){
        print(....);
    }
}
sixtyfootersdude
  • 25,859
  • 43
  • 145
  • 213