10

A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.

I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.

Is there any tool that can do this?

Wooble
  • 87,717
  • 12
  • 108
  • 131
W.Sun
  • 868
  • 2
  • 11
  • 19
  • But.... why? :p I suppose you could - if you follow the CFG specification you are guaranteed to come up with at-least-syntactically valid C code. But I don't think anybody's ever tried to do such a thing - you'd probably need to write some sort of reverse parser... – Robert Dec 17 '10 at 05:56
  • I've worked with a few ... (this is a joke; I love all my colleages). – Noon Silk Dec 17 '10 at 05:56
  • 2
    is there any reason one would want random, nonsensical source code that would (most likely) do nothing? – TJ Ellis Dec 17 '10 at 05:57
  • 3
    Yes, eg. a random test case generator for compiler/interpreter, or just for fun. – W.Sun Dec 17 '10 at 06:14
  • 1
    Note that beyond fuzz testing, it becomes hard to maintain the correctness of the test cases to ensure contextual correctness (e.g. no repeated variable declarations in the same scope) or type correctness – grrussel Jan 05 '11 at 14:51

4 Answers4

5

The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.

In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.

grrussel
  • 7,209
  • 8
  • 51
  • 71
2

Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.

For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.

The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.

Community
  • 1
  • 1
Anderson Green
  • 30,230
  • 67
  • 195
  • 328
1

If you have a model of the grammar in a normalized form (all rules like this):

 LHS = RHS1 RHS2 ...  RHSn ;

and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.

Simply start with the goal symbol as a unit tree.

  Repeat until no nonterminals are left:
    Pick a nonterminal N in the tree;
       Expand by adding children for the right hand side of any rule
       whose left-hand side matches the nonterminal N

For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.

A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.

Then prettyprint the result. Its the prettyprinter part that is hard to get right.

[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].

A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
0

the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices

there is lot more on this subject in: Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963) (see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)