9

I need to get the function blocks (definition and everything, not just declaration), in order to get a function dependency graph. From the function dependency graph, identify connected components and modularize my insanely huge C codebase, one file at a time.

Problem : I need a C parser to identify function blocks, just that, nothing more. We have custom types etc but signature goes

storage_class return_type function_name ( comma separated type value pairs )
{

//some content I view as generic stuff

}

Solution that I've come up with : Use sly and pycparser like any sane person would do, obviously.

Problem with pycparser : Needs to compile pre-processors from other files, just to identify the code-blocks. In my case, things go to depth of 6 levels. I am sorry I can't show the actual code.

Attempted Code with Sly :

from sly import Lexer, Parser
import re

def comment_remover(text):
    def replacer(match):
        s = match.group(0)
        if s.startswith('/'):
            return " " # note: a space and not an empty string
        else:
            return s
    pattern = re.compile(
        r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
        re.DOTALL | re.MULTILINE
    )
    return re.sub(pattern, replacer, text)

class CLexer(Lexer):
    ignore = ' \t\n'
    tokens = {LEXEME, PREPROP, FUNC_DECL,FUNC_DEF,LBRACE,RBRACE, SYMBOL}
    literals = {'(', ')',',','\n','<','>','-',';','&','*','=','!'}
    LBRACE = r'\{'
    RBRACE = r'\}'
    FUNC_DECL = r'[a-z]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]*\([a-zA-Z_\* \,\t\n]+\)[ ]*\;'
    FUNC_DEF = r'[a-zA-Z_0-9]+[ \n\t]+[a-zA-Z_0-9]+[ \n\t]*\([a-zA-Z_\* \,\t\n]+\)'
    PREPROP = r'#[a-zA-Z_][a-zA-Z0-9_\" .\<\>\/\(\)\-\+]*'
    LEXEME = r'[a-zA-Z0-9]+'
    SYMBOL = r'[-!$%^&*\(\)_+|~=`\[\]\:\"\;\'\<\>\?\,\.\/]'


    def __init__(self):
        self.nesting_level = 0
        self.lineno = 0

    @_(r'\n+')
    def newline(self, t):
        self.lineno += t.value.count('\n')

    @_(r'[-!$%^&*\(\)_+|~=`\[\]\:\"\;\'\<\>\?\,\.\/]')
    def symbol(self,t):
        t.type = 'symbol'
        return t

    def error(self, t):
        print("Illegal character '%s'" % t.value[0])
        self.index += 1

class CParser(Parser):
    # Get the token list from the lexer (required)
    tokens = CLexer.tokens

    @_('PREPROP')
    def expr(self,p):
        return p.PREPROP

    @_('FUNC_DECL')
    def expr(self,p):
        return p.FUNC_DECL

    @_('func')
    def expr(self,p):
        return p.func

    # Grammar rules and actions
    @_('FUNC_DEF LBRACE stmt RBRACE')
    def func(self, p):
        return p.func_def + p.lbrace + p.stmt + p.rbrace

    @_('LEXEME stmt')
    def stmt(self, p):
        return p.LEXEME

    @_('SYMBOL stmt')
    def stmt(self, p):
        return p.SYMBOL

    @_('empty')
    def stmt(self, p):
        return p.empty

    @_('')
    def empty(self, p):
        pass

with open('inputfile.c') as f:
    data = "".join(f.readlines())
    data = comment_remover(data)
    lexer = CLexer()
    parser = CParser()
    while True:
        try:
            result = parser.parse(lexer.tokenize(data))
            print(result)
        except EOFError:
            break

Error :

None
None
None
.
.
.
.
None
None
yacc: Syntax error at line 1, token=PREPROP
yacc: Syntax error at line 1, token=LBRACE
yacc: Syntax error at line 1, token=PREPROP
yacc: Syntax error at line 1, token=LBRACE
yacc: Syntax error at line 1, token=PREPROP
.
.
.
.
.

INPUT:

#include <mycustomheader1.h> //defines type T1
#include <somedir/mycustomheader2.h> //defines type T2
#include <someotherdir/somefile.c>

MACRO_THINGY_DEFINED_IN_SOMEFILE(M1,M2) 

static T1 function_name_thats_way_too_long_than_usual(int *a, float* b, T2* c)
{

 //some code I don't even care about at this point

}

extern T2 function_name_thats_way_too_long_than_usual(int *a, char* b, T1* c)
{

 //some code I don't even care about at this point

}

DESIRED OUTPUT:


function1 : 

static T1 function_name_thats_way_too_long_than_usual(int *a, float* b, T2* c)
{

 //some code I don't even care about at this point

}

function2 :

extern T2 function_name_thats_way_too_long_than_usual(int *a, char* b, T1* c)
{

 //some code I don't even care about at this point

}


LazyCoder
  • 1,267
  • 5
  • 17
  • If your problem is preprocessing, have you considered passing the input through a preprocessor before parsing it? – rici Jul 27 '19 at 22:59
  • No and I think that's beside the point. The function in C has a pattern. I need a pushdown automaton to recognize the sequence. ```gcc -E``` might speed up the recognition of symbols part, still I need a PDA to recognize function blocks. Right? – LazyCoder Jul 28 '19 at 07:24

1 Answers1

5

pycparser has a func_defs example to do exactly what you need, but IIUC you're having issues with the preprocessing?

This post describes in some detail why pycparser needs preprocessed files, and how to set it up. If you control the build system it's actually pretty easy. Once you have preprocessed files, the example mentioned above should work.

I will also note that statically finding function dependencies is not an easy problem, because of function pointers. You also won't be able to do this accurately with a single file - this needs multi-file analysis.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • Does it show only function definitions or function code blocks as well. The first I can achieve with a regex as shown in the problem. The latter is what I need, return function def along with function body. – LazyCoder Jul 28 '19 at 13:39
  • @LazyCoder: it shows where the function definitions are and gives you their AST nodes (this is *way* more robust than using regexes). Once you have nodes you can use the C generator (https://github.com/eliben/pycparser/blob/master/examples/c-to-c.py) to emit them back to C and get your definiitons. But my impression was you want to analyze the AST of those functions to find calls to other functions rather than print the body back out – Eli Bendersky Jul 28 '19 at 13:41
  • Yep, I need to get ASTs of functions and get function calls to other other functions. Function pointer case is rare in my code. Combining from comments, I need to preprocess first and then use pycparser to give me ASTs and then from ASTs I need to find calls to other functions? Is there a simpler way? – LazyCoder Jul 28 '19 at 14:16
  • 1
    @LazyCoder: pycparser will give you the AST for each function definition. You can analyze this AST to find calls. There's another example in the `examples` directory which finds *all calls*, so you should be able to combine these examples to a working solution fairly easily once you have the preprocessed code. – Eli Bendersky Jul 29 '19 at 03:28
  • Holy ships at the harbour! You are the author of pycparser? Sorry for being hard on your work dude. I just didn't want to throw cannons at sparrows. I'll see what I can do with your code. – LazyCoder Jul 29 '19 at 05:42