2

im looking for way how to remove whole bodies from functions in some C source file.

For example I have file with this content:

1.  int func1 (int para) {
2.    return para;
3.  }
4.
5.  int func2 (int para) {
6.    if (1) {
7.      return para;
8.    }
9.    return para;
10. }

I have tried these regex:

content = re.sub('(\{[.*]?\})', '', content, flags=re.DOTALL)

But there is problem with nested { }. This regex substitute only to first }, so lines 9 and 10 are still in content. I think solution should be in counting { and } brackets and stop substitution when counter is on 0. { is found => counter++, } is found => counter--. But I have no idea how to implement this in python. Can u guys give me a kick?

nich
  • 83
  • 2
  • 7
  • 7
    You would probably be better served by picking a real parser [from some available options](http://wiki.python.org/moin/LanguageParsing) -- parsing C via regexp is doomed to frustration and annoyance. I expect you can make it work for simple toys but any real codebase will probably use _some_ construct that you have trouble duplicating with regexp. – sarnold Apr 25 '12 at 22:55
  • Out of curiosity why do you want to do that? I'd be very surprised if there wasn't a better way to solve your real problem. – Flexo Apr 25 '12 at 22:59
  • My real problem is get return type, name and parameters list of every function, so before matching this I would like to cutt all useless information from source file. For example macros, comments and bodies. But the nested problem stucked me. – nich Apr 25 '12 at 23:07
  • 1
    check out this question, you have to use recursion. http://stackoverflow.com/questions/10318351/removing-replacing-multi-line-code-sections-with-python/10319390#10319390 – Ashwini Chaudhary Apr 25 '12 at 23:16
  • The theory says it is impossible to do what you want with regular expressions. Regular expressions handle regular languages (that's where their name comes from). A language with matching delimiter pairs is not regular. It is, at best, context-free (a larger class of languages). You need an actual parser (see @sarnold's excellent answer). – cvoinescu Apr 26 '12 at 00:45
  • Ashwini Chaudhary: Thanks, I used recursion with indexing brackets and counter, how I was planning. It works nice :-) – nich Apr 26 '12 at 01:47
  • @user1323007- Be careful if you're counting brackets. That method only works if all of your brackets are in unique matched pairs, which isn't necessarily the case (consider the case of multiple lines like `if (...) {` inside an `#if`..`#else` block that all share the same `}`). I recommend that you run the code through a pre-processor (like `unifdef`) first in order to simplify things. – bta Apr 26 '12 at 21:10

4 Answers4

9

I think you're trying to re-invent a wheel that has already been implemented many times before. If all you want is to extract the signature of each function in a C file, there are much easier ways to do it.

The ctags utility will take care of this for you:

~/test$ ctags -x --c-types=f ./test.c
func1            function      1 ./test.c         int func1 (int para) {
func2            function      5 ./test.c         int func2 (int para) {
~/test$ # Clean up the output a little bit
~/test$ ctags -x --c-types=f ./test.c | sed -e 's/\s\+/ /g' | cut -d ' ' -f 5-
int func1 (int para) {
int func2 (int para) {
bta
  • 43,959
  • 6
  • 69
  • 99
  • 1
    Agreed. This question is well treated in this [question](http://stackoverflow.com/questions/1570917/extracting-c-c-function-prototypes) – Zeugma Apr 25 '12 at 23:48
  • 1
    Having done six months of static analysis of C in Python for an internship, I couldn't agree more with the author's sentiment about the problem. Otherwise, you will likely go down the road of parsing and syntax trees, or devise a kludgy regex solution that misses some weird edge case. Go with ctags and don't reinvent the wheel. – mvanveen Apr 26 '12 at 00:29
  • This is clever -- get the best of another tool that already does a really good job at parsing C. Nice. – sarnold Apr 26 '12 at 01:01
  • I'm using the latest version of ctags and I'm getting this error: `ctags: unrecognized option '--c-types=f' Try `ctags --help' for a complete list of options.` – Soubriquet Aug 16 '16 at 20:41
  • @Soubriquet I believe newer versions may have changed the word "types" to "kinds" in the command-line options, but the syntax is otherwise the same. – bta Aug 17 '16 at 20:04
0

Here is one of my script to remove function bodies from a C source file. The only requirement was ctags from brew in Mac OSX, not the ctags built in Mac OSX. I was not sure why it did not work with the built-in ctags in Mac OSX. You can install ctags using brew by typing in a command:

$ brew install ctags

Then, use the following PERL script named dummyc.pl with a C source file. For example, the input C source:

int
func1 (int para)
{
  return para;
}

int
func2 (int para)
{
  if (1)
    {
      return para;
    }
  return para;
}

This is the output:

int
func1 (int para)
{
  return 0;
}

int
func2 (int para)
{
  return 0;
}

This is the PERL script:

#!/usr/bin/env perl
use strict;
use warnings;

unless ( @ARGV == 1 )
{
  print "Filter out the body of C functions.
Usage: dummyc.pl file.c
Required: ctags (e.g., \$ brew install ctags)\n";
  exit;
}

my $cfile = $ARGV[0];
my $lc = 1;
my $kindPrev = "comment";
my $lnPrev = 1;
my $lsPrev = "comment";
my $namePrev = "comment";
my $line = 1;
open(CFILE, $cfile) or die "could not open $cfile: $!";
open(PIPE, "/usr/local/Cellar/ctags/5.8/bin/ctags -xu $cfile|") or die "couldn't start pipe: $!";
while ($line)
{
  last unless $line;
  # R_USE_SIGNALS    macro        24 errors.c         #define R_USE_SIGNALS 1
  $line = <PIPE>;
  my $name;
  my $kind;
  my $ln;
  my $ls;
  if ($line)
  {
    $line =~ /^(\S+)\s+(\w+)\s+(\d+)\s+$cfile\s+(.+)/;
    $name = $1;
    $kind = $2;
    $ln = $3;
    $ls = $4;
  }
  else
  {
    $ln = 1000000;
  }

  if ($kindPrev eq "function") 
  {
    my $isFunctionBody = 0;
    my $hasStartBrace = 0;
    my $hasReturnValue = 1;
    my $noReturn = 0;
    for (my $i = $lnPrev; $i < $ln; $i++)
    {
      my $cline = <CFILE>;
      last unless $cline;

      if ($cline =~ /void.+$namePrev/)
      {
        $hasReturnValue = 0;  
      }
      if ($cline =~ /NORET.+$namePrev/)
      {
        $noReturn = 1;  
      }
      if ($isFunctionBody == 0 and $cline =~ /\{/)
      {
        $isFunctionBody = 1;
        unless ($cline =~ /^\{/)
        {
          $hasStartBrace = 1;
          print $cline;
        }
      }
      elsif ($cline =~ /^\}/)
      {
        $isFunctionBody = 0;
        print "{\n" if $hasStartBrace == 0;
        if ($noReturn == 0)
        {
          if ($hasReturnValue == 1)
          {
            print "  return 0;\n";
          }
          else
          {
            print "  return;\n";
          }
        }
      }
      unless ($isFunctionBody == 1)
      {
        print $cline;
      }
    }
  }
  else
  {
    for (my $i = $lnPrev; $i < $ln; $i++)
    {
      my $cline = <CFILE>;
      last unless $cline;
      print $cline;
    }
  }
  $kindPrev = $kind;
  $lnPrev = $ln;
  $lsPrev = $ls;
  $namePrev = $name;
}
close(PIPE) or die "couldn't close pipe: $! $?";
close(CFILE) or die "couldn't close $cfile: $! $?";

You might want to edit the PERL script, though.

Sangcheol Choi
  • 841
  • 1
  • 12
  • 19
0

Here is a pure python solution and very simple to implement.

Function extracting the body

Basically, you try to match each { with a corresponding }:

  • If there are two { before the next } then you are entering a scope.
  • On the other hand, if there is one } before the next {, then you are exiting the scope.

The implementation is then trivial:

  • you look for all the indexes of { and } that you maintain in different list
  • you also maintain a scope depth variable
    • if the current { position is below the current } position, you are entering a scope, you add 1 to the scope depth and you move to the next { position
    • if the current { position is above the current } position, you are exiting a scope, you remove 1 to the scope depth and you move to the next } position
  • if the scope depth variable is 0, then you found the closing brace of the function body

Suppose you have the string starting right after the first brace of your function body (brace excluded), calling the following function with this substring will give you the position of the last brace:

def find_ending_brace(string_from_first_brace):
  starts = [m.start() for m in re.finditer('{', string_from_first_brace, re.MULTILINE)]
  ends = [m.start() for m in re.finditer('}', string_from_first_brace, re.MULTILINE)]

  i = 0
  j = 0
  current_scope_depth = 1

  while(current_scope_depth > 0):  
    if(ends[j] < starts[i]):
      current_scope_depth -= 1
      j += 1
    elif(ends[j] > starts[i]):
      current_scope_depth += 1
      i += 1
      if(i == len(starts)): # in case we reached the end (fewer { than })
        j += 1
        break

  return ends[j-1]

Extracting candidate function definition

Now, if the original string of your file is in the variable my_content,

find_func_begins = [m for m in re.finditer("\w+\s+(\w+)\s*\((.*?)\)\s*\{", my_content)]

will give you the prototypes of each function (find_func_begins[0].group(1) == func1 and find_func_begins[0].group(2) == 'int para'), and

my_content[
  find_func_begins[0].start():
    find_func_begins[0].end() +
    find_ending_brace(my_content[find_func_begins[0].end():])]

will give you the content of the body.

Extracting the prototypes

I suppose you should look again for the function definition after the first ending brace is reached, since the regex for find_func_begins is a bit loose. Iterating over each function definition and matching braces yields the following iterative algorithm:

reg_ex = "\w+\s+(\w+)\s*\((.*?)\)\s*\{"
last = 0
protos = ""
find_func_begins = [m for m in re.finditer(reg_ex, my_content[last:], re.MULTILINE | re.DOTALL)]
while(len(find_func_begins) > 0):
  function_begin = find_func_begins[0]
  function_proto_end = last + function_begin.end()
  protos += my_content[last: function_proto_end-1].strip() + ";\n\n"

  last = function_proto_end + find_ending_brace(my_content[function_proto_end:]) + 1
  find_func_begins = [m for m in re.finditer(reg_ex, my_content[last:], re.MULTILINE | re.DOTALL)]

You should have what you want in protos. Hope this helps!

Raffi
  • 3,068
  • 31
  • 33
0

I need to clean it up :)

class FuncBody(object):

def __init__(self):

    self.stack = []

def stack_insert(self, sym_list):

    sym_list.sort(key=lambda x : x[1])
    #print "sym_list ", sym_list

    for sym, idx in sym_list:
        #print "here ", sym, idx
        if self.stack != [] and\
                (self.stack[-1] == '{' and sym == '}'):
            self.stack.pop()
        else:
            self.stack.append(sym)

def get_body(self, filepath, start):

    Begin = False
    self.stack = []
    for lineno in range(start, get_line_count(filepath)):
        #print lineno, getline(filepath, lineno)
        if False == Begin and\
                '{' in getline(filepath, lineno):
            self.stack_insert([('{', m.start())\
                    for m in re.finditer('{', getline(filepath, lineno))]+\
                    [('}', m.start()) for m in\
                    re.finditer('}', getline(filepath, lineno))])
            #print "Begin"
            Begin = True
            yield getline(filepath, lineno)
            continue
        if True == Begin and self.stack == []:
            #print "End here"
            break
        elif True == Begin:
             self.stack_insert([('{', m.start())\
                    for m in re.finditer('{', getline(filepath, lineno))]+\
                    [('}', m.start()) for m in\
                    re.finditer('}', getline(filepath, lineno))])

        #print "stack ", self.stack
        yield getline(filepath, lineno)
Ankit Gupta
  • 580
  • 6
  • 11