0

I have written this C code to take a String from stdin, tokenize it, and then print each token:

#include <stdio.h>
#include <stdlib.h>

void tokenize(char *buffer, char** tokens, int n) {
  int tokens_index = 0;
  int curr_index = 0;

  for(int i = 0; i < n; i++) {
    if(buffer[i] != ' ' && buffer[i] != '\0') {
      tokens[tokens_index][curr_index] = buffer[i];
      curr_index++;
    }
    else if(i != 0 && buffer[i] == ' ' && buffer[i-1] == ' ') {
      continue;
    }
    else if(buffer[i] == ' ') {
      curr_index = 0;
      tokens_index++;
    }
    else if(buffer[i] == '\0') {
      tokens_index++;
      tokens[tokens_index][0] == '\0';
      break;
    }
  }
}


int main(int argc, char **argv) {
  char buffer[50];

  while(fgets(buffer, 50, stdin)) {
    char **tokens = (char **) malloc(sizeof(char *) * 20);
    for(int i = 0; i < 20; i++) {
      tokens[i] = (char *) malloc(sizeof(char) * 50);
    }
    tokenize(buffer, tokens, 50);
    for(int i = 0; i < 20; i++) {
      if(*tokens[i] == '\0') {
        break;
      }
      printf("%s\n", tokens[i]);
    }

    for(int i = 0; i < 20; i++) {
      free(tokens[i]);
    }
    free(tokens);
  }

  return 0;
}

Input:

ab cd ef
gh ij kl

The first time the loop runs, it is able to successfully tokenize the user's input and print it to stdout. However, if I run the loop additional times the tokens are outputed, but garbage is outputted as well.

Output:

ab
cd
ef

gh
ij
kl
��
��
`�

If I remove the free from inside the nested for loop however, it ends up working perfectly.

I don't understand why it is doing this. Shouldn't my program be able to just allocate new memory for each token after I free the char * in the previous iteration? What am I missing?

trincot
  • 317,000
  • 35
  • 244
  • 286
  • Without seeing what `tokenize()` does, it's hard to tell. Are you sure it adds an extra token that begins with `\0` at the end of the tokens? – Barmar Sep 22 '22 at 17:11
  • Unrelated to the problem: [don't cast malloc](https://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc) – Barmar Sep 22 '22 at 17:14
  • `tokenize()` doesn't add null terminators at the end of each token. Any reason you don't use `strtok()`? – Barmar Sep 22 '22 at 17:16
  • 1
    The reason it works the first time is just pure luck. `malloc()` is returning new memory that's initialized to `0` the first time because the program just started. The second time, it reuses previously allocated memory, and it's not zeroed. – Barmar Sep 22 '22 at 17:19
  • Your program always prints 20 tokens. – user253751 Sep 22 '22 at 17:20
  • Solved! Thanks Barmar. By calling memset(tokens[i], 0, 20) inside the loop where I allocate the memory it is able to stop printing junk. – boring_pencil Sep 22 '22 at 17:35

1 Answers1

0

Always enable and heed your compiler's warnings. (I use -Wall -Wextra -pedantic with gcc.) It finds one of the problems.

<source>: In function 'tokenize':
<source>:22:31: warning: statement with no effect [-Wunused-value]
   22 |       tokens[tokens_index][0] == '\0';
      |       ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~

There is a second problem, though. -fsanitize=address can help you find the problem, which is that you don't terminate the tokens with a NUL.

ikegami
  • 367,544
  • 15
  • 269
  • 518