So I have the following function:
void tokenize() {
char *word;
char text[] = "Some - text, from stdin. We'll see! what happens? 4ND 1F W3 H4V3 NUM83R5?!?";
int nbr_words = 0;
word = strtok(text, " ,.-!?()");
while (word != NULL) {
printf("%s\n", word);
word = strtok(NULL, " ,.-!?()");
nbr_words += 1;
}
}
And the output is:
Some
text
from
stdin
We'll
see
what
happens
4ND
1F
W3
H4V3
NUM83R5
13 words
Basically what I'm doing is tokenizing paragraphs of text into words for futher analysis down the road. I have my text, and I have my delimiters. The only problem is tokenizing numbers at the same time as all the rest of the delimiters. I know that I can use isdigit
in ctype.h
. However, I don't know how I can include it in the strtok
.
For example (obviously wrong): strtok(paragraph, " ,.-!?()isdigit()");
Something along those lines. But since I have each token (word) at this stage, is there some kind of post-processing if
statement I could use to further tokenize each word, splitting at digits?
For example, the output would further degrade to:
ND
F
W
H
V
NUM
R
15 words // updated counter to include new tokens