The easiest solution is to use a tool like flex to generate your lexer and let it do the work of breaking the input into tokens (although flex expects its input to come from a file stream, not a character array).
strtok()
isn't a good solution for several reasons:
- It overwrites the input, which you may want to preserve for use later;
- It's a brute force tool and doesn't handle badly-formed input well;
- If you use your arithmetic operators as the token separators, then the operators themselves will get clobbered.
The usual solution is to write a state machine (which is basically what flex does for you). Here's a very quick-n-dirty (emphasis on the dirty) example:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>
/**
* Read from a string specified by source, updating the pointer as we go.
* We're assuming that token points to a buffer large enough to hold
* our largest token; ideally, you would want to pass the length of the
* target buffer and check against it, but I'm leaving it out for brevity.
*
* Tokens are either integers (strings of digits) or operators.
*
* Return 1 if we successfully read a token, 0 if we encountered an unexpected
* character, and EOF if the next character is the end of the input string.
*/
int getToken(char **source, char *token)
{
enum {START, DIGIT, ERROR, DONE} state = START;
size_t i = 0;
char *operators="+-*/";
if (**source == 0) // at end of input
return EOF;
while (**source != 0)
{
switch(state)
{
/**
* Initial state for this call.
*/
case START:
if (isdigit(**source))
{
state = DIGIT;
token[i++] = *(*source)++; // append the digit to the token
}
else if (strchr(operators, **source) != NULL)
{
state = DONE;
token[i++] = *(*source)++; // add the operator to the token
token[i++] = 0; // and terminate the string
}
else if (isspace(**source))
{
(*source)++; // ignore whitespace
}
else
{
/**
* We've read something that isn't a digit, operator, or
* whitespace; treating it as an error for now.
*/
state = ERR;
}
break;
/**
* We've read at least one digit.
*/
case DIGIT:
if (isdigit(**source))
{
token[i++] = *(*source)++; // append next digit to token
}
else
{
/**
* We've read a non-digit character; terminate the token
* and signal that we're done.
*/
token[i++] = 0;
state = DONE;
}
break;
case DONE:
return 1;
break;
case ERR:
return 0;
break;
}
}
return 1;
}
int main(int argc, char **argv)
{
char token[20];
char *input = argv[1];
for (;;)
{
int result = getToken(&input, token);
if (result == 1)
printf("%s\n", token);
else if (result == 0)
{
printf("Bad character '%c'; skipping\n", *input);
input++;
}
else if (result == EOF)
{
printf("done\n");
break;
}
}
return 0;
}
Why (*source)++
instead of *source++
or source++
? I don't want to update source
, I want to update what source
points to, so I have to dereference the pointer before the ++
is applied. The expression *(*source)++
basically translates to "give me the value of the character that the expression *source
is pointing to, then update the value of *source
".