4

Consider the following snippet that uses strtok to split the string madddy.

char* str = (char*) malloc(sizeof("Madddy"));
strcpy(str,"Madddy");

char* tmp = strtok(str,"d");
std::cout<<tmp;

do
{
    std::cout<<tmp;
    tmp=strtok(NULL, "dddy");
}while(tmp!=NULL);

It works fine, the output is Ma. But by modifying the strtok to the following,

tmp=strtok(NULL, "ay");

The output becomes Madd. So how does strtok exactly work? I have this question because I expected strtok to take each and every character that is in the delimiter string to be taken as a delimiter. But in certain cases it is doing that way but in few cases, it is giving unexpected results. Could anyone help me understand this?

alk
  • 69,737
  • 10
  • 105
  • 255
Karthick
  • 2,844
  • 4
  • 34
  • 55
  • 6
    I honestly think the correct way to do this is to completely stop using `strtok`. It's a difficult-to-use, hard-to-debug function with no thread-safety guarantees at all. You're probably best off using some combination of `string::find` and `string::substr` to do the parsing. – templatetypedef Jan 14 '11 at 02:31
  • I am willing to repeat this for importance and emphasis, especially since you are using C++ and not C. Also, you might want to look into boost::tokenize. – Jim Brissom Jan 14 '11 at 02:42
  • 1
    You're not printing a newline or other symbol to separate the matching tokens. Madddy, with delimiter characters d and y (no need to specify d three times), only contains the "Ma" token and trailing delimiters. Madddy with delimiters a and y consists of tokens "M" and "ddd" - print them without spaces and you see "Mddd". You say you saw "Madd"? I assume that's a typo...? – Tony Delroy Jan 14 '11 at 02:51
  • @Tony: I think Karthick is right. The problem here is that Karthick doesn't use a seperator betweens his token, so it's very difficult to give an exact answer. – Hoàng Long Jan 14 '11 at 03:19
  • Changed tag to C. People who look at the C tag may be better able to help than pure C++ developers who generally prefer other forms of tokenization. – Martin York Jan 14 '11 at 06:52
  • For an opposite view, if you know what you are doing, strtok is just fine. I've always been of the opinion that pulling in boost to do a simple job that strtok can do is overkill. Certainly cases can be made that something more robust is needed, but sometimes using boost is like using a sledgehammer to drive a nail. – Mark Jan 14 '11 at 19:40
  • Thing that worries me is that inspite of so much of discussion on my question, I have got just one vote for mine!! – Karthick Jan 17 '11 at 00:29
  • @template yes. strtok, fscanf, atoi... they all should be simply banned from use. – EvilTeach Dec 03 '11 at 22:40

6 Answers6

9

"Trying to understand strtok" Good luck!

Anyway, we're in 2011. Tokenise properly:

std::string str("abc:def");
char split_char = ':';
std::istringstream split(str);
std::vector<std::string> token;

for (std::string each; std::getline(split, each, split_char); token.push_back(each));

:D

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
3

Fred Flintstone probably used strtok(). It predates multi threaded environments and beats up (modifies) the source string.

When called with NULL for the first parameter, it continues parsing the last string. This feature was convenient, but a bit unusual even in its day.

wallyk
  • 56,922
  • 16
  • 83
  • 148
2

Actually your code is wrong, no wonder you get unexpected results:

char* str = (char*) malloc(sizeof("Madddy"));

should be

char* str = (char*) malloc(strlen("Madddy") + 1);
AndersK
  • 35,813
  • 6
  • 60
  • 86
  • 1
    Yep, the first example probably allocates 4 bytes (in 32-bit environments) which is the size of a pointer. The type of a string constant like `"abcdefghijkm"` is a pointer (specifically `char *` or `const char *` depending on the compiler). – wallyk Jan 14 '11 at 02:49
  • 1
    @wallyk: Actually, [it doesn't](http://codepad.org/H7zJkjCN). The type of string literals is a char array. @Anders: Though the code is weird, yours works *identically.* – Fred Nurk Jan 14 '11 at 05:31
  • @Fred Nurk: Oh yes. That seems to break the pattern, probably to be more useful. It's long been a construct I avoid. (Apologies for the wrong information two comments above.) – wallyk Jan 14 '11 at 06:16
  • It is best to avoid the use of `sizeof` though, in cases like this where it's easy to get confused as to what you're actually doing. Prefer to write a template function for array length, then you definitely can't go wrong: `template size_t array_size(const (T&)[N]) { return N; }`. – Lightness Races in Orbit Jan 14 '11 at 10:19
2

It seems you forget that you have call strtok the first time (out of loop) by delimiter "d".

The strtok is working fine. You should have a reference here.

For the second example(strtok("ay")):

First, you call strtok(str, "d"). It will look for the first "d", and seperate your string. Specifically, it sets tmp = "Ma", and str = "ddy" (dropping the first "d").

Then, you call strtok(str, "ay"). It will look for an "a" in str, but since your string now is only "ddy", no matching occurs. Then it will look for an "y". So str = "dd" and tmp = "".

It prints "Madd" as you saw.

Hoàng Long
  • 10,746
  • 20
  • 75
  • 124
  • @Karthick: The first example works, but it may not work the way you think. I recommending that instead of cout << tmp, using cout << tmp << " - " to see what actually happens. You may see there are lots of empty strings. – Hoàng Long Jan 14 '11 at 03:17
  • please tell me what about a same piece of code surrounded by another loop, would there be a way to reinitialize strtok ? – Yvain Apr 20 '19 at 01:27
0

I asked a question inspired from another question about functions causing security problems/bad practise functions and the c standard library.

To quote the answer given to me from there:

A common pitfall with the strtok() function is to assume that the parsed string is left unchanged, while it actually replaces the separator character with '\0'.

Also, strtok() is used by making subsequent calls to it, until the entire string is tokenized. Some library implementations store strtok()'s internal status in a global variable, which may induce some nasty suprises, if strtok() is called from multiple threads at the same time.

As you've tagged your question C++, use something else! If you want to use C, I'd suggest implementing your own tokenizer that works in a safe fashion.

Community
  • 1
  • 1
0

Since you changed your tag to be C and not C++, I rewrote your function to use printf so that you can see what is happening. Hoang is correct. You seeing correct output, but I think that you are printing everything on the same line, so you got confused by the output. Look at Hoang's answer as he explains what is happening correctly. Also, as others have noted, strtok destroys the input string, so you have to be careful about that - and it's not thread safe. But if you need a quick an dirty tokenizer, it works. Also, I changed the code to correctly use strlen, and not sizeof as correctly pointed out by Anders.

Here is your code modified to be more C-like:

char* str = (char*) malloc(strlen("Madddy") + 1);
strcpy(str,"Madddy");

char* tmp = strtok(str,"d");
printf ("first token: %s\n", tmp);

do
{
    tmp=strtok(NULL, "ay");
    if (tmp != NULL ) {
       printf ("next token: %s\n", tmp);
    }
} while(tmp != NULL);
Mark
  • 10,022
  • 2
  • 38
  • 41