Creating a Lexical Analyzer in C

Question

I am trying to create a lexical analyzer in C. The program reads another program as input to convert it into tokens, and the source code is here-

#include <stdio.h>
#include <conio.h>
#include <string.h>

int main()  {
    FILE *fp;
    char read[50];
    char seprators [] = "\n";
    char *p;
    fp=fopen("C:\\Sum.c", "r");

    clrscr();

    while ( fgets(read, sizeof(read)-1, fp) !=NULL )    {
        //Get the first token
        p=strtok(read, seprators);

        //Get and print other tokens
        while (p!=NULL) {
            printf("%s\n", p);
            p=strtok(NULL, seprators);
        }
    }

    return 0;
}

And the contents of Sum.c are-

#include <stdio.h>

int main()  {
    int x;
    int y;
    int sum;

    printf("Enter two numbers\n");
    scanf("%d%d", &x, &y);

    sum=x+y;

    printf("The sum of these numbers is %d", sum);

    return 0;
}

I am not getting the correct output and only see a blank screen in place of output.

Can anybody please tell me where am I going wrong?? Thank you so much in advance..

Make *sure* you're file opens successfully. You never check it, often the current working directory at runtime isn't what you think, especially if you're running from an IDE or other such tool. Verify `fopen()` succeeded and if it didn't, `perror("Failed to open file.");` and exit. At least you'll know that was the issue. — WhozCraig, Aug 23 '13 at 06:33
Choice of `seprators` value is `"\n";` is wrong if you want to print tokens. — Grijesh Chauhan, Aug 23 '13 at 06:35
If you really want to create a lexical analyzer, I highly recommend you to use Flex (http://en.wikipedia.org/wiki/Flex_lexical_analyser) — Michael M., Aug 23 '13 at 06:45
@Michael: This is probably homework to work out _how_ a lexer works. In which case Flex is the worst way to learn that! — dave, Aug 23 '13 at 11:45
@dave- You are right... This is a part of an academic assignment... — prateekmathur1991, Aug 23 '13 at 13:07

score 1 · Accepted Answer · edited May 23 '17 at 11:53

You've asked a few question since this one, so I guess you've moved on. There are a few things that can be noted about your problem and your start at a solution that can help others starting to solve a similar problem. You'll also find that people can often be slow at answering things that are obvious homework. We often wait until homework deadlines have passed. :-)

First, I noted you used a few features specific to Borland C compiler which are non-standard and would not make the solution portable or generic. YOu could solve the problem without them just fine, and that is usually a good choice. For example, you used #include <conio.h> just to clear the screen with a clrscr(); which is probably unnecessary and not relevant to the lexer problem.

I tested the program, and as written it works! It transcribes all the lines of the file Sum.c to stdout. If you only saw a blank screen it is because it could not find the file. Either you did not write it to your C:\ directory or had a different name. As already mentioned by @WhozCraig you need to check that the file was found and opened properly.

I see you are using the C function strtok to divide the input up into tokens. There are some nice examples of using this in the documentation you could include in your code, which do more than your simple case. As mentioned by @Grijesh Chauhan there are more separators to consider than \n, or end-of-line. What about spaces and tabs, for example.

However, in programs, things are not always separated by spaces and lines. Take this example:

result=(number*scale)+total;

If we only used white space as a separator, then it would not identify the words used and only pick up the whole expression, which is obviously not tokenization. We could add these things to the separator list:

char seprators [] = "\n=(*)+;";

Then your code would pick out those words too. There is still a flaw in that strategy, because in programming languages, those symbols are also tokens that need to be identified. The problem with programming language tokenization is there are no clear separators between tokens.

There is a lot of theory behind this, but basically we have to write down the patterns that form the basis of the tokens we want to recognise and not look at the gaps between them, because as has been shown, there aren't any! These patterns are normally written as regular expressions. Computer Science theory tells us that we can use finite state automata to match these regular expressions. Writing a lexer involves a particular style of coding, which has this style:

while ( NOT <<EOF>> ) {
  switch ( next_symbol() ) {

     case state_symbol[1]: 
              ....
             break;

      case state_symbol[2]:
              ....
              break;

       default:
             error(diagnostic);
  }
}

So, now, perhaps the value of the academic assignment becomes clearer.

Thank you very much for such an extensive answer... I literally had moved on, but this answer of yours made me understand the importance of lab assignments we did back in grad school.... — prateekmathur1991, May 02 '15 at 09:19

Creating a Lexical Analyzer in C

1 Answers1