Good way to parse code -- written in a C-like or Lisp-like (or any) language -- into an array using C#?

Question

What would be a good way to parse a C-like or Lisp-like code into an array, using C#?

So for example, for a little snippet like the following:

if (number > 50) {
    alert('Hello, World!');
}

I want to be able to store every word and symbol into an array.

But up until now I managed to output an array like the following:

[0] if
[1] (number
[2] >
[3] 50)
[4] {
[5] alert('Hello,
[6] World!');
[7] }

You see at array location 1, where it says (number? That's not really what I want. I want even that little parenthesis to be placed into its own array location.

What I was initially thinking on doing was to read every character of the code, and then start storing them into arrays accordingly. But that seems like I'm reinventing the wheel when parsing strings. Are there any simpler way of doing this?

p.s. I'm doing this because I want to learn proper string manipulation.

This is called tokenizing, and it's the first step of building a compiler. I'd research compiler tokenizing, there are probably dozens of examples out there for tokenizing C-like languages. — Chris Eberle, Jun 11 '11 at 15:46
This question has been asked many times before; most end up referring to http://stackoverflow.com/questions/1669/. Also, an array is not a good representation of syntax and, if you do decide to store your code in an array, strings are not good representations of tokens. First decide if you want to learn how to parse code or to manipulate strings, the two are different problems. — Dour High Arch, Jun 11 '11 at 16:47

Dialecticus · Accepted Answer · 2011-06-11T16:34:58.153

3

There are many rules to parsing C language, and you can't simply tokenize the code with whitespace characters.

You need to have a notion of symbols. Tokens . , - + / * -> ( ) = == != < > <= >= << >> ; ? : " ' & && | || ~ (and so on) are all symbols. If during parsing you stumble upon one of those then treat it as separate token, regardless of what character comes next. After " and ' disregard this rule, until you come to another "/', unless if it's after escape character \. And there are comment handling, and trigraphs, and macros handling, and many more things to be aware of.

edited Jun 11 '11 at 16:34

answered Jun 11 '11 at 16:29

Dialecticus

16,400
7
43
103

The escape character is backslash (`'\\'`), not forward slash (`'/'`). – Ben Voigt Jun 11 '11 at 16:34

score 1 · Answer 2 · answered Jun 11 '11 at 16:34

1

Read about fslex and fsyacc. It might be a good starting point to learn about abstract syntax trees, lexers and parsers.

Also F# lexers and parsers written with fslex and fsyacc are easy to use in .NET application.

answered Jun 11 '11 at 16:34

Dmitry

3,069
1
17
26

I've been looking for a simple but not trivial exmample of an F# parser - I'm in a similar boat as the poster, with a desire to get my feet wet with parsing. – Aaron Anodide Jun 11 '11 at 16:36
1

Or Antlr, which [also can create parsers written in C#](http://www.antlr.org/wiki/pages/viewpage.action?pageId=557075) – Ben Voigt Jun 11 '11 at 16:36
@Gabriel, 'Expert F#' by Don Syme has a simple but useful example. Unfortunately, examples which I've seen in web don't cover everything: for example I haven't seen an example with %right/%left/%noassoc or several lexer rules to parse comments. So I highly recommend Don Syme's book. – Dmitry Jun 11 '11 at 16:41

score 0 · Answer 3 · answered Jun 11 '11 at 16:33

You could try to set up a parser in a way that you first check if the text is a kind of "something", and then tokenize it accordingly.

For a book describing this very thing, please take a look at the "Structure and Interpretation of Computer Programs" (also known as SICP) book available online which is used by many universities world wide. You can find an example of the eval function they use as a starting point.

Good way to parse code -- written in a C-like or Lisp-like (or any) language -- into an array using C#?

3 Answers3