Antlr grammar for parsing C source code files and getting functions from them

Question

I wrote an Antlr grammar for parsing functions from C source code files:

grammar newCfunctions;

options
{
    language = CSharp;
}
@parser::namespace { Generated }
@lexer::namespace  { Generated }

func
    :function+ { Console.WriteLine("hello"); } //this is for debugging
    ;
NAME
    :[a-zA-Z]+[a-zA-Z0-9]*
    ;
TYPENAME
    :   'void'
    |   [a-zA-Z]+
    |   'char'
    |   'short'
    |   'int'
    |   'long'
    |   'float'
    |   'double'
    |   'signed'
    |   'unsigned'
    |   '_Bool'
    |   '_Complex'
    |   '__m128'
    |   '__m128d'
    |   '__m128i'
    |   NAME
    ;
arguments
    :   (TYPENAME NAME)*
    ;
Newline
    :   '\r'? '\n' ;
FUNCTIONBODY
    :   ([a-zA-Z0-9]|Newline)*;
function 
    :   TYPENAME ' ' NAME '(' arguments ')' ' '? Newline? '{' FUNCTIONBODY '}' Newline?
    ;

I generatet C# files and included them into test project. Main function of it:

            try
            {
                AntlrInputStream input = new AntlrInputStream(Console.In);
                newCfunctionsLexer lexer = new newCfunctionsLexer(input);
                CommonTokenStream tokens = new CommonTokenStream(lexer);
                newCfunctionsParser parser = new newCfunctionsParser(tokens);
                parser.func();
            }
            catch (Exception e)
            {
                Console.WriteLine(e.Message);
            }
            Console.ReadKey();

When I write "void foo(int a){return a;}" it gives me ann error: "line 1:0 mismatched input 'void' expecting TYPENAME". Please, help me with this grammar! I saw C grammar in the Internet, but it has 800+ lines and i don't know what to do with it. If you know, how to use it, promt me please. Thank you!

If you want to really parse C source code, you need an accurate grammar, a preprocessor and some kind of symbol table. You will find the effort to put this together a lot more than you might expect. (See an example C parse: http://stackoverflow.com/questions/2143552/recommend-c-front-end-that-preserves-preprocessor-directives/2173477#2173477). If you don't care if your parse is precise and detailed, you can design a sloppy grammar that may read any valid C chunk, but that grammar has to generalize a precise one accurately or you'll get errors like the one you have. — Ira Baxter, May 28 '16 at 16:09
I'm pretty sure the ANTLR site has a much better C grammar, and no, I wouldn't be surprised if was 800 lines. C is NOT a simple language in spite of what you may think. — Ira Baxter, May 28 '16 at 16:18
@Ira Baxter I know, that C language is not simple. But my aim is not to parse all the C language, I want to parse only functions blocks. C grammar from the ANTLR site has a lot of stuff, I don't need that much. I can't take parts of it, because they depend on each other. — Bodryi, May 28 '16 at 16:33
To parse a "function" (block? not a defined term in C to my knowledge), you need most of the language. Maybe you don't mean "parse" in the usual sense of the word. For most of us, the narrow computer-science interpretation of "parse " is "extract the structure and detail", at which point you cannot avoid using the knowledge in a grammar (if not using a grammar directly). — Ira Baxter, May 28 '16 at 16:54
@IraBaxter I have source code files and I need to get functions ("blocks" like this: void foo(*arguments*){*do smth*} ) from these files. I've changed the title for better understanding. — Bodryi, May 28 '16 at 17:57
The grammar you have here won't work because of how lexer rules are handled. `void` is matched to `NAME`, since it appears first in the grammar, but if you put `TYPENAME` first, then you'll get no `NAME`, since `TYPENAME` includes `NAME`. A simple grammar like that won't do if you need precise parsing (Ira is right). If you still want to hack around, you'll need to read the ANTLR book to understand how it works. — Lucas Trzesniewski, May 28 '16 at 18:14
If all @Bodryi wants to do is pick up function *headers* with a *blob* of text for the body, he can write a grammar sort of like this. Somehow to match the function body he will have to match '{'... '}'' pairs, which means his lexer must pick out those tokens unerringly ("// abc } def " isn't one) which means his lexer must know enough of the langauge to pick out those tokens. Then of course he has to get the top level of it right. But I really can't imagine what one can to with a function header/function-body-blob parser. Maybe OP will tell us why he wants that, that it isn't enuf. — Ira Baxter, May 28 '16 at 20:46

score 0 · Accepted Answer · answered May 28 '16 at 18:39

As it has been said NAME rule should be placed after TYPENAME rule. Moreover lexem TYPENAME should not contain lexem NAME and [a-zA-Z]+.

So, the final verison:

grammar newCfunctions;

options
{
    language = CSharp;
}
@parser::namespace { Generated }
@lexer::namespace  { Generated }

func
    : function+ { Console.WriteLine("hello"); } //this is for debugging
    ;
function 
    : typename ' ' NAME '(' arguments ')' ' '? Newline? '{' functionBody '}' Newline?
    ;
arguments
    : (typename NAME)*
    ;
typename
    : TYPENAME
    | NAME
    ;
functionBody
    : (TYPENAME | NAME | Newline)*
    ;
TYPENAME
    :   'void'
    |   'char'
    |   'short'
    |   'int'
    |   'long'
    |   'float'
    |   'double'
    |   'signed'
    |   'unsigned'
    |   '_Bool'
    |   '_Complex'
    |   '__m128'
    |   '__m128d'
    |   '__m128i'
    ;
NAME
    : [a-zA-Z]+ [a-zA-Z0-9]*
    ;
Newline
    :   '\r'? '\n' ;

Also I advise to use channels for newlines and spaces ignoring in parsing process.

Antlr grammar for parsing C source code files and getting functions from them

1 Answers1