2

So I have the following function:

void tokenize() {
    char *word;
    char text[] = "Some - text, from stdin. We'll see! what happens? 4ND 1F W3 H4V3 NUM83R5?!?";
    int nbr_words = 0;

    word = strtok(text, " ,.-!?()");

    while (word != NULL) {
    printf("%s\n", word);
    word = strtok(NULL, " ,.-!?()");
    nbr_words += 1;
    }
}

And the output is:

Some
text
from
stdin
We'll
see
what
happens
4ND
1F
W3
H4V3
NUM83R5


13 words

Basically what I'm doing is tokenizing paragraphs of text into words for futher analysis down the road. I have my text, and I have my delimiters. The only problem is tokenizing numbers at the same time as all the rest of the delimiters. I know that I can use isdigit in ctype.h. However, I don't know how I can include it in the strtok.

For example (obviously wrong): strtok(paragraph, " ,.-!?()isdigit()");

Something along those lines. But since I have each token (word) at this stage, is there some kind of post-processing if statement I could use to further tokenize each word, splitting at digits?

For example, the output would further degrade to:

ND
F
W
H
V
NUM
R

15 words // updated counter to include new tokens
Chris Cirefice
  • 5,475
  • 7
  • 45
  • 75

2 Answers2

2

strtok is very simple in this respect: just list all the digits as delimiters, one by one - like this:

strtok(paragraph, " ,.-!?()0123456789");

Note: strtok is an old, non-reentrant function that should not be used in modern programs. You should switch to strtok_r, which has a similar interface, but can be used in concurrent environments and other situations when you need reentrancy.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • Can you define `reentrancy`? I've read (today) that `strtok` is not thread-safe because it doesn't save the state or something? I've never done multi-threading, not even in Java. However, this program is just a really simple in-out text processor, not for anything serious except my own personal use. – Chris Cirefice Oct 20 '13 at 16:55
  • @ChrisCirefice `strtok` is not thread-safe (and not reentrant) because it is using a global variable to keep its state. That is how it knows to continue returning parts of the string being tokenized when you keep passing `NULL` to it. If you do not plan for any concurrency, you can certainly use `strtok`, but it is a good idea to learn about `strtok_r` as well. Here is a good answer that discusses reentrancy and thread safety: [link](http://stackoverflow.com/a/856860/335858). – Sergey Kalinichenko Oct 20 '13 at 17:00
  • That's a great read! Considering the simplicity of this program, I think that `strtok` will work just fine. Thank you for the reference though! – Chris Cirefice Oct 20 '13 at 17:06
1

Why not just use

    word = strtok(text, " ,.-!?()1234567890");
ciphermagi
  • 747
  • 3
  • 14
  • I would mark this as the correct answer because you did answer first. However, @dasblinkenlight provided explanation as well as a more desirable function to use :) – Chris Cirefice Oct 20 '13 at 16:56