How to arrange the matched output in c#?

Question

i'm matching words to create simple lexical analyzer. here is my example code and output example code:

public class 
{
    public static void main (String args[]) 
    {
       System.out.println("Hello");
    }
}

output:

public = identifier
void = identifier
main = identifier
class = identifier

as you all can see my output is not arranged as the input comes. void and main comes after class but in output the class comes at the end. i want to print result as the input is matched.

c# code:

    private void button1_Click(object sender, EventArgs e)
    {
        if (richTextBox1.Text.Contains("public"))
            richTextBox2.AppendText("public = identifier\n");

        if (richTextBox1.Text.Contains("void"))
            richTextBox2.AppendText("void = identifier\n");

        if (richTextBox1.Text.Contains("class"))
            richTextBox2.AppendText("class = identifier\n");

        if (richTextBox1.Text.Contains("main"))
            richTextBox2.AppendText("main = identifier\n");

    }

You're not going to build a lexer using successive `string.Contains`. You'll have to walk over the text and give semantic meaning to tokens based on their context. — CodeCaster, Dec 11 '14 at 13:30
good beginning might be http://stackoverflow.com/questions/540593/lex-yacc-for-c — road242, Dec 11 '14 at 13:35

Pieter Witvoet · Answer 1 · 2014-12-11T15:02:06.780

Your code is asking the following qustions:

Does the input contain the text "public"? If so, write down "public = identifier".
Does the input contain the text "void"? If so, write down "void = identifier".
Does the input contain the text "class"? If so, write down "class = identifier".
Does the input contain the text "main"? If so, write down "main = identifier".

The answer to all of these questions is yes, and since they're executed in that exact order, the output you get should not be surprising. Note: public, void, class and main are keywords, not identifiers.

Splitting on whitespace?

So your approach is not going to help you tokenize that input. Something slightly more in the right direction would be input.Split() - that will cut up the input at whitespace boundaries and give you an array of strings. Still, there's a lot of whitespace entries in there.

input.Split(new char[] { ' ', '\t', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries) is a little better, giving us the following output: public, class, {, public, static, void, main, (String, args[]), {, System.out.println("Hello");, } and }.

But you'll notice that some of these strings contain multiple 'tokens': (String, args[]) and System.out.println("Hello");. And if you had a string with whitespace in it it would get split into multiple tokens. Apparently, just splitting on whitespace is not sufficient.

Tokenizing

At this point, you would start writing a loop that goes over every character in the input, checking if it's whitespace or a punctuation character (such as (, ), {, }, [, ], ., ;, and so on). Those characters should be treated as the end of the previous token, and punctuation characters should also be treated as a token of their own. Whitespace can be skipped.

You'll also have to take things like string literals and comments into account: anything in-between two double-quotes should not be tokenized, but be treated as part of a single 'string' token (including whitespace). Also, strings can contain escape sequences, such as \", that produce a single character (that double quote should not be treated as the end of the string, but as part of its content).

Anything that comes after two forward slashes should be ignored (or parsed as a single 'comment' token, if you want to process comments somehow), until the next newline (newline characters/sequences differ across operating systems). Anything after a /* should be ignored until you encounter a */ sequence.

Numbers can optionally start with a minus sign, can contain a dot (or start with a dot), a scientific notation part (e..), which can also be negative, and there are type suffixes...

In other words, you're writing a state machine, with different behaviour depending on what state you're in: 'string', 'comment', 'block comment', 'numeric literal', and so on.

Lexing

It's useful to assign a type to each token, either while tokenizing or as a separate step (lexing). public is a keyword, main is an identifier, 1234 is an integer literal, "Hello" is a string literal, and so on. This will help during the next step.

Parsing

You can now move on to parsing: turning a list of tokens into an abstract syntax tree (AST). At this point you can check if a list of tokens is actually valid code. You basically repeat the above step, but at a higher level.

For example, public, protected and private are keyword tokens, and they're all access modifiers. As soon as you encounter one of these, you know that either a class, a function, a field or a property definition must follow. If the next token is a while keyword, then you signal an error: public while is not a valid C# construct. If, however, the next token is a class keyword, then you know it's a class definition and you continue parsing.

So you've got a state machine once again, but this time you've got states like 'class definition', 'function definition', 'expression', 'binary expression', 'unary expression', 'statement', 'assignment statement', and so on.

Conclusion

This is by no means complete, but hopefully it'll give you a better idea of all the steps involved and how to approach this. There are also tools available that can generate parsing code from a grammar specification, which can ease the job somewhat (though you still need to learn how to write such grammars).

You may also want to read the C# language specification, specifically the part about its grammar and lexical structure. The spec can be downloaded for free from one of Microsofts websites.

Good explanation. +1 but there is one thing to mention: You can't simply remove white spaces while 'tokenizing' because some spaces may be part of literals. — B0Andrew, Dec 11 '14 at 14:41

score 0 · Answer 2 · answered Dec 11 '14 at 13:36

CodeCaster is right. You are not on the right path. I have an lexical analyzer made by me some time ago as a project.

I know, I know I'm not supposed to put things on a plate here, but the analyzer is for c++ so you'll have to change a few things.

Take a look at the source code and please try to understand how it works at least: C++ Lexical Analyzer

score 0 · Answer 3 · answered Dec 11 '14 at 13:37

In the strictest sense, the reason for the described behaviour is that in the evaluating code, the search for void comes before the search for class. However, the approach in total seems far too simple for a lexical analysis, as it simply checks for substrings. I totally second the comments above; depending on what you are trying to achieve in the big picture, a more sophisticated approach might be necessary.

How to arrange the matched output in c#?

3 Answers3