Tokenizing a string in C?

Question

I'm working on a terminal parser for a calculator written in C. I cannot figure out how to concatenate all of the numbers that are in between operators to put them into an array.

For example, if the input (command line argument) was "4+342", it would ideally be input[] = {"4", "+", "342"}.

Here's my code so far. I'm including <stdio.h>, <stdlib.h>, and <ctype.h>.

typedef char * string;

int main(int argc, char *argv[])
{
  string inputS = argv[1];
  string input[10];
  string temp;
  printf("%s\n", inputS);
  int i;
  int len = strlen(inputS);
  printf("parsed:\n");
  for(i = 0; i < len; inputS++, i++)
  { 
    if(isdigit(*inputS))
    {
      printf("%c",*inputS);
    }
    else
    {
      printf("\n%c\n",*inputS);
    }
  }
  printf("\n");
  return 0;
}

If it is run with ./calc 4+5-546, it will output:

So what's the easiest way to get each line of this into its own array slot?

!!! There is **no** real string type in C, `char *` is definitely **not** one, and your typedef is **very dangerous** because it will encourage you to do some very wrong things. Pointers don't hold data in C; they point at it, and the memory must be allocated. If you want to call anything `string` in C, be prepared to do a lot of work to build up a half-decent abstraction. — Karl Knechtel, Dec 28 '10 at 17:37
@Karl Knechtel, thanks for explaining that,i'm just starting out with C from Java so I'm just starting to understand how dangerous something like C is, whereas Java runs in a virtual machine and can't mess too much up. Ill keep the char * thing in mind next time. Is there a way you would recommend though? Maybe have a set limit on the char? [50] or so? Or should I just be using malloc to give me the space for it. Sorry if it's a stupid question, like I said, i'm still just a beginner. — Devan Buggay, Dec 29 '10 at 15:43
It depends on why you're learning C, tbh. It might very well be that, whatever your reason is for learning C, there's a better way to achieve your goal than by actually learning C. ;) — Karl Knechtel, Dec 29 '10 at 19:56

Jonathan Leffler · Accepted Answer · 2010-12-28T18:16:47.647

Try this for size...

#include <stdio.h>
#include <ctype.h>

typedef char * string;

int main(int argc, char *argv[])
{
    string inputS = argv[1];
    string input[50];   /* Up to 50 tokens */
    char   buffer[200];
    int    i;
    int    strnum = 0;
    char  *next = buffer;
    char   c;

    if (argc != 2)
    {
        fprintf(stderr, "Usage: %s expression\n", argv[0]);
        return 1;
    }

    printf("input: <<%s>>\n", inputS);
    printf("parsing:\n");

    while ((c = *inputS++) != '\0')
    { 
        input[strnum++] = next;
        if (isdigit(c))
        {
            printf("Digit: %c\n", c);
            *next++ = c;
            while (isdigit(*inputS))
            {
                c = *inputS++;
                printf("Digit: %c\n", c);
                *next++ = c;
            }
            *next++ = '\0';
        }
        else
        {
            printf("Non-digit: %c\n", c);
            *next++ = c;
            *next++ = '\0';
        }
    }

    printf("parsed:\n");
    for (i = 0; i < strnum; i++)
    {
        printf("%d: <<%s>>\n", i, input[i]);
    }

    return 0;
}

Given the program is called tokenizer and the command:

tokenizer '(3+2)*564/((3+4)*2)'

It gives me the output:

input: <<(3+2)*564/((3+4)*2)>>
parsing:
Non-digit: (
Digit: 3
Non-digit: +
Digit: 2
Non-digit: )
Non-digit: *
Digit: 5
Digit: 6
Digit: 4
Non-digit: /
Non-digit: (
Non-digit: (
Digit: 3
Non-digit: +
Digit: 4
Non-digit: )
Non-digit: *
Digit: 2
Non-digit: )
parsed:
0: <<(>>
1: <<3>>
2: <<+>>
3: <<2>>
4: <<)>>
5: <<*>>
6: <<564>>
7: <</>>
8: <<(>>
9: <<(>>
10: <<3>>
11: <<+>>
12: <<4>>
13: <<)>>
14: <<*>>
15: <<2>>
16: <<)>>

score 2 · Answer 2 · answered Dec 28 '10 at 18:05

The easiest solution is to use a tool like flex to generate your lexer and let it do the work of breaking the input into tokens (although flex expects its input to come from a file stream, not a character array).

strtok() isn't a good solution for several reasons:

It overwrites the input, which you may want to preserve for use later;
It's a brute force tool and doesn't handle badly-formed input well;
If you use your arithmetic operators as the token separators, then the operators themselves will get clobbered.

The usual solution is to write a state machine (which is basically what flex does for you). Here's a very quick-n-dirty (emphasis on the dirty) example:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

/**
 * Read from a string specified by source, updating the pointer as we go.
 * We're assuming that token points to a buffer large enough to hold
 * our largest token; ideally, you would want to pass the length of the
 * target buffer and check against it, but I'm leaving it out for brevity.
 * 
 * Tokens are either integers (strings of digits) or operators. 
 *
 * Return 1 if we successfully read a token, 0 if we encountered an unexpected
 * character, and EOF if the next character is the end of the input string.
 */
int getToken(char **source, char *token)
{
  enum {START, DIGIT, ERROR, DONE} state = START;
  size_t i = 0;
  char *operators="+-*/";

  if (**source == 0)  // at end of input
    return EOF;

  while (**source != 0)
  {
    switch(state)
    {
      /**
       * Initial state for this call.
       */
      case START: 
        if (isdigit(**source))
        {
          state = DIGIT;
          token[i++] = *(*source)++; // append the digit to the token
        }
        else if (strchr(operators, **source) != NULL)
        {
          state = DONE;
          token[i++] = *(*source)++; // add the operator to the token
          token[i++] = 0;            // and terminate the string
        }
        else if (isspace(**source))
        {
          (*source)++;  // ignore whitespace
        }
        else
        {
          /**
           * We've read something that isn't a digit, operator, or 
           * whitespace; treating it as an error for now.
           */
          state = ERR;
        }
        break;

      /**
       * We've read at least one digit.
       */
      case DIGIT:
        if (isdigit(**source))
        {
          token[i++] = *(*source)++; // append next digit to token
        }
        else
        {
          /**
           * We've read a non-digit character; terminate the token
           * and signal that we're done. 
           */
          token[i++] = 0;
          state = DONE;
         }
         break;

      case DONE:
        return 1;
        break;

      case ERR:
        return 0;
        break;
    }
  }
  return 1;
}

int main(int argc, char **argv)
{
  char token[20];
  char *input = argv[1];
  for (;;)
  {
    int result = getToken(&input, token);
    if (result == 1)
      printf("%s\n", token);
    else if (result == 0)
    {
      printf("Bad character '%c'; skipping\n", *input);
      input++;
    }
    else if (result == EOF)
    {
      printf("done\n");
      break;
    }
  }
  return 0;
}

Why (*source)++ instead of *source++ or source++? I don't want to update source, I want to update what source points to, so I have to dereference the pointer before the ++ is applied. The expression *(*source)++ basically translates to "give me the value of the character that the expression *source is pointing to, then update the value of *source".

klefevre · Answer 3 · 2010-12-28T16:59:29.900

1

--> MAN STRCAT

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main (int argc, const char **argv)
{
    char *toto_str = "Toto";
    char *is_str = "Is";
    char *awesome_str = "Awesome";
    char *final_str;
    size_t i;

    i = strlen(toto_str);
    i += strlen(is_str);
    i += strlen(awesome_str);

    final_str = malloc((i * sizeof(char)) + 1);
    strcat(final_str, toto_str);
    strcat(final_str, is_str);
    strcat(final_str, awesome_str);

    printf("%s", final_str);
    free(final_str);

    return 0;
}

edited Dec 28 '10 at 16:59

answered Dec 28 '10 at 16:54

klefevre

8,595
7
42
71

You have to ensure that `final_str[0] = '\0';` before launching into your sequence of `strcat()` operations - because `malloc()` does not guarantee to return you initialized (zeroed) data. – Jonathan Leffler Dec 28 '10 at 17:06
1

That's how to concatenate, but OP wrote the question badly and isn't actually asking anything about concatenation. – Karl Knechtel Dec 28 '10 at 17:43

score 1 · Answer 4 · answered Jan 01 '11 at 21:16

strsep is a good choice here - grab the token and then decide what you want to do with it...

char *string = "(3+(5+6)/8)"

char token; while ((token = strsep(&string, "(+/) "))) { // Store token... if it's not a ( or ) or space }

Here - token will be processed similar to a Split() in Java/C#. This does mutilate the string while processing it - however, with the correct delimiters - things will be good :)

score 0 · Answer 5 · answered Dec 28 '10 at 16:36

0

Sounds like you want to look at the standard strtok function.

answered Dec 28 '10 at 16:36

Andy Lester

91,102
13
100
152

3

`strtok()` is not helpful here; it tramples NUL '\0' bytes onto the source string at the end of the token, but in the example, there are no spaces that can safely be trampled. – Jonathan Leffler Dec 28 '10 at 17:04

score 0 · Answer 6 · answered Dec 28 '10 at 16:54

0

this will give you an idea:

#include <stdio.h>
#include <string.h>
main(int argc, char *argv[])
{
    printf("\nargv[1]: %s",argv[1]);
    char *p;
    p = strtok(argv[1],"+");
    printf("\np: %s", p);
    p = strtok(NULL,"+");
    printf("\np: %s", p);
    p = strtok(NULL,"+");
    printf("\np: %s", p);
    printf("\n");
}

This is just a sample code to demonstrate how it is done using addition case only.
Get the main idea of this code and apply it in your code.
Example output for this:

./a.out 5+3+9

argv[1]: 5+3+9
p: 5
p: 3
p: 9

Again, I am only demonstrating the "+" sign. You may want to check for p until it is NULL, then proceed with the next operation, say subtraction, then multiplication, then division.

answered Dec 28 '10 at 16:54

Neilvert Noval

1,655
2
15
21

3

`strtok()` is not the right tool here; you happen to know that every operator is `+`, but in the question, one of the operators is `-` and if your code is modified to allow for that, `strtok()` obliterates the operators before you get a chance to see what it is. – Jonathan Leffler Dec 28 '10 at 17:08
I'm kind of confused as to why the first argument in most of your strtok's are NULL. I just finished reading about them and I guess I don't fully understand what exactly the function does. Could you explain further why you have to "p = strtok(NULL,"+") until the end of the string? – Devan Buggay Dec 28 '10 at 17:09
@jonathan: you can actualy use it with `4+5-546`. The code will give you `4` and `5-546` allowing you to perhaps parse the `5-546` before adding it to 4 for example. – Neilvert Noval Dec 28 '10 at 17:16
@pwnmonkey: pass NULL to iterate through your string. The first strtok points to the string before the "+". Passing it a NULL to strtok will move your pointer to the next string after the "+" – Neilvert Noval Dec 28 '10 at 17:18
Suppose the program is to be a 4-function calculator. How do you decide what the operators are? They're `"+-*/"`; so you pass that as the second argument to `strtok()`, and it obliterates the operator. If you have to read ahead to find out what the operator is before you use `strtok()`, then you have defeated the point of using `strtok()`. – Jonathan Leffler Dec 28 '10 at 17:34
@Jonathan no, he proposes to tokenise the string multiple times - with "+", then with "-" etc. and take notes on what NULLs were inserted on each run. – Karl Knechtel Dec 28 '10 at 17:42

Tokenizing a string in C?

6 Answers6

Linked