5

The question title might be misleading when read out of context. Let me first explain what I am trying to build.

I am building a script which will take 100s of very simple C programs written by my students and check for some very basic properties such as.

  1. Have they declared a variable called 'x', is it's type 'int' and so on.
  2. what is the value of variable 'z' ?

If this was some kind of scripting programming language this could have been a lot easier. I could have simply used include or eval and then done the checks.

But when it comes to C programming I would say this is very tough. How can I do it ?

Eastern Monk
  • 6,395
  • 8
  • 46
  • 61
  • 2
    I wish my professors gave me unit-testing code to code against at that time! why don't you provide a simple unit-testing module where students should code against? – Khaled Alshaya Sep 08 '11 at 00:41
  • I wan to build a portal where students can come in their spare time, type out some code in the browser against the listed objective submit and they will gain points for each successful objective. – Eastern Monk Sep 08 '11 at 00:44
  • 1
    Well, you could tell them to get input from standard input, and provide output to standard output. This way, on the server side you can try to compile the code, if it compiles give input & receive output through standard streams. – Khaled Alshaya Sep 08 '11 at 00:47
  • How's your compiler writing skills? You can use these [Lex](http://www.lysator.liu.se/c/ANSI-C-grammar-l.html) and [Yacc](http://www.lysator.liu.se/c/ANSI-C-grammar-y.html) grammars to parse through the C code. Alternatively, use an IR like LLVM. Either way, just traverse the AST to verify your desired properties. – chrisaycock Sep 08 '11 at 01:15
  • 2
    @chisaycock: And if you have enough time in your life to build a full C parser, then you can get such answers. Most of us have finite lives and want to do something else. – Ira Baxter Sep 08 '11 at 04:09
  • @Ira, its not hard with antlr and pre done grammars – Keith Nicholas Sep 08 '11 at 04:21
  • @Keith: If you use a predefined grammar, preprocessor, and symbol table, you are right. My reaction was to "you can use Lex and Yacc to..." which is an invitation to huge amount of work. Regarding ANTLR: I believe there is a C parser for ANSI C; don't know about other dialects. I'm not sure there is any kind of symbol table constructed, so the answer still isn't easy: given the symbol x, how would you know, with the ANTLR parser, that x is declared as int? – Ira Baxter Sep 08 '11 at 04:26
  • @Ira putting a symbol table is trivial, I think it took me 10 minutes . so things like is there an x, and what type is x ok. What the value of something is? whole new ballgame :) – Keith Nicholas Sep 08 '11 at 04:44
  • @Keith: for all of C? Macro definitions, local scopes, struct contents, parameter lists, external functions, special properties of various dialects (DLLs? Interrupt functions? embedded assembler register names)? 10 minutes? You're a better man than I. – Ira Baxter Sep 08 '11 at 04:46
  • @Ira, no, but a symbol table for the things he needs in this case I think is trivial (ish). – Keith Nicholas Sep 08 '11 at 05:01
  • @Keith: Sure, if his (student's) programs are *always* trivial he can use trivial symbol tables. If his students are any good, they'll write a program that uses a non-trivial feature and he'll get a black eye misdiagnosing it. Simple solutions statistically work well. They don't work well on specific instances. Gentlemen, place your bets. – Ira Baxter Sep 08 '11 at 05:05
  • @Ira, its based on his rules, if he is trying to ask questions like "is there a variable x" and "is x an int" then it doesn't matter what the students write, capturing just scoped variables and parameters would answer those kinds of questions. It's likely anything beyond a trivial snippet of C code is a fail :) – Keith Nicholas Sep 08 '11 at 05:09
  • @Keith: we're in full agreement. If his problem is always trivial, he can use trivial solutions. Place your bets. Sorry, my experience is that whenever you do this, you offend people that build real code. – Ira Baxter Sep 08 '11 at 05:13
  • Yes, my students are going to write trivial problems. The complex programs probably will be a program to check if a string is a palindrome. – Eastern Monk Sep 08 '11 at 23:02

4 Answers4

2

You can use ANTLR to do this. There is already a C grammar you can use with ANTLR so mainly all you have to do is pass the code into antlr then walk the syntax tree looking for various attributes....

you can use ANTLR from a number of languages. While it might seem daunting at first. It's actually surprisingly easy to work with.

Keith Nicholas
  • 43,549
  • 15
  • 93
  • 156
1

I saw an initiative to make an XML dumper for the tree built by GCC, called gcc-xml. So you give it a file example1.cxx like:

struct EmptyClass {};

int a_function(float f, EmptyClass e)
{
}

int main(void)
{
    return 0;
}

You'll get back:

<?xml version="1.0"?>
<GCC_XML>
  <Namespace id="_1" name="::" members="_2 _3 _4 "/>
  <Function id="_2" name="main" returns="_5" context="_1" location="f0:8"/>
  <Function id="_3" name="a_function" returns="_5" context="_1" location="f0:4">
    <Argument name="f" type="_6"/>
    <Argument name="e" type="_4"/>
  </Function>
  <Struct id="_4" name="EmptyClass" context="_1" location="f0:1" members="_7 _8 " bases=""/>
  <FundamentalType id="_5" name="int"/>
  <FundamentalType id="_6" name="float"/>
  <Constructor id="_7" name="EmptyClass" context="_4" location="f0:1">
    <Argument name="_ctor_arg" type="_9"/>
  </Constructor>
  <Constructor id="_8" name="EmptyClass" context="_4" location="f0:1"/>
  <ReferenceType id="_9" type="_4c"/>
  <File id="f0" name="example1.cxx"/>
</GCC_XML>

Caveats would be that it only works with the subset of C compatible with C++, and the official project doesn't support the dumping of function bodies. I don't know how much progress the unofficial efforts have made:

http://www.djlauk.de/index.php/Projects/GccXmlFunctionBodies

  • After writing my answer, I found this here on StackOverflow...which mentions (among other things) Dehydra, clang, and Elsa: http://stackoverflow.com/questions/1444961/is-there-a-good-python-library-that-can-parse-c – HostileFork says dont trust SE Sep 08 '11 at 01:27
0

An important distinction is what you mean by "checking" these programs. I assume you don't mean actually compiling/running them during runtime (which is what reflection means to me), because you can't check their program structure from a compiled program.

That means you're going to be text processing the source files, and while this is much easier in pretty much every scripting language, it's not that bad in C:

  1. Figure out how to get a list of all the files to be processed. Either put all the file names in an index file, and iterate over that file, OR get a list of files by using readdir or preferably a library (Boost) to do it for you.

  2. For each file, open it and read each line (Google, it's trivial)

  3. For each line, check it against your rules, and collect the necessary results.

  4. Store result in an array, or write it to a file, etc.

Edit-If you want to check to see if your students' programs run (aka, compile/run them during runtime), you'd probably have to execve off some gcc calls (if you want to compile), and then again to run the program. However, execve is only going to tell you if the command failed. Getting output from other programs would mean opening pipes with popen. If you find yourself ready to do this, turn back, you've gone too far.

prelic
  • 4,450
  • 4
  • 36
  • 46
0

Reflection from inside the language is always limited by the language designers (the designers of C simply don't allow any). If you use a tool that can inspect any part of the language from outside the language, you don't have the problem of being limited by what the language designers imagined. (All the current languages with reflection limit what what you can learn by writing code in the language. This seems to me to be a stupid constraint.)

The way to analyze a program is to an agent the understands the programming language from outside, and doesn't have any such limitations. See our DMS Software Reengineering Toolkit for a system that can provide the ultimate in program inspection. Using its C Front End you can arguably ask any well-formulated question about the C source code in a reliable way.

The specific questions you ask could be answered easily by DMS and its C front end (the one about the value of Z might be hard; you're reasoning about a Turing machine).

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341