59

I have a string that I would like to tokenize. But the C strtok() function requires my string to be a char*. How can I do this simply?

I tried:

token = strtok(str.c_str(), " "); 

which fails because it turns it into a const char*, not a char*

Flexo
  • 87,323
  • 22
  • 191
  • 272

13 Answers13

80
#include <iostream>
#include <string>
#include <sstream>
int main(){
    std::string myText("some-text-to-tokenize");
    std::istringstream iss(myText);
    std::string token;
    while (std::getline(iss, token, '-'))
    {
        std::cout << token << std::endl;
    }
    return 0;
}

Or, as mentioned, use boost for more flexibility.

Community
  • 1
  • 1
Chris Blackwell
  • 9,189
  • 1
  • 25
  • 27
  • strtok() supports multiple delimiters while getline does not. Is there a simple way to circumvent that? – thegreatcoder Jan 24 '19 at 23:25
  • 1
    @thegreatcoder I believe you could use regex_token_iterator to tokenize with multiple delimiters. And thanks for the blast from the past, I answered the original question a loooooong time ago :) – Chris Blackwell Jan 25 '19 at 20:32
22

Duplicate the string, tokenize it, then free it.

char *dup = strdup(str.c_str());
token = strtok(dup, " ");
free(dup);
DocMax
  • 12,094
  • 7
  • 44
  • 44
  • 3
    Isn't the better question, why use strtok when the language in question has better native options? – Kendall Helmstetter Gelner Nov 14 '08 at 06:24
  • 1
    Not necessarily. If the context of the question surrounds maintaining a fragile codebase, then stepping away from the existing approach (notionally strtok in my example) is riskier than changing the approach. Without more context in the question, I prefer to answer what is asked. – DocMax Nov 14 '08 at 06:27
  • If the asker is a newbie, you should want against doing free() before using token... :-) – PhiLho Nov 14 '08 at 06:34
  • I am dubious that using a more robust native tokenizer is ever less safe than inserting new code that calls a library that inserts nulls into the block of memory passed to it... that's why I did not think it a good idea to answer the question as asked. – Kendall Helmstetter Gelner Nov 14 '08 at 06:48
  • Note that `strtok()` is not thread-safe or re-entrant. In an program with multiple tasks, it should be avoided. – Colin D Bennett Aug 10 '15 at 15:28
  • Also, while we are at it, we should note that `strdup()` comes from POSIX which is why it may be preferable not to use it. – FanaticD Apr 26 '17 at 12:10
20
  1. If boost is available on your system (I think it's standard on most Linux distros these days), it has a Tokenizer class you can use.

  2. If not, then a quick Google turns up a hand-rolled tokenizer for std::string that you can probably just copy and paste. It's very short.

  3. And, if you don't like either of those, then here's a split() function I wrote to make my life easier. It'll break a string into pieces using any of the chars in "delim" as separators. Pieces are appended to the "parts" vector:

    void split(const string& str, const string& delim, vector<string>& parts) {
      size_t start, end = 0;
      while (end < str.size()) {
        start = end;
        while (start < str.size() && (delim.find(str[start]) != string::npos)) {
          start++;  // skip initial whitespace
        }
        end = start;
        while (end < str.size() && (delim.find(str[end]) == string::npos)) {
          end++; // skip to end of word
        }
        if (end-start != 0) {  // just ignore zero-length strings.
          parts.push_back(string(str, start, end-start));
        }
      }
    }
    
Todd Gamblin
  • 58,354
  • 15
  • 89
  • 96
9

There is a more elegant solution.

With std::string you can use resize() to allocate a suitably large buffer, and &s[0] to get a pointer to the internal buffer.

At this point many fine folks will jump and yell at the screen. But this is the fact. About 2 years ago

the library working group decided (meeting at Lillehammer) that just like for std::vector, std::string should also formally, not just in practice, have a guaranteed contiguous buffer.

The other concern is does strtok() increases the size of the string. The MSDN documentation says:

Each call to strtok modifies strToken by inserting a null character after the token returned by that call.

But this is not correct. Actually the function replaces the first occurrence of a separator character with \0. No change in the size of the string. If we have this string:

one-two---three--four

we will end up with

one\0two\0--three\0-four

So my solution is very simple:


std::string str("some-text-to-split");
char seps[] = "-";
char *token;

token = strtok( &str[0], seps );
while( token != NULL )
{
   /* Do your thing */
   token = strtok( NULL, seps );
}

Read the discussion on http://www.archivum.info/comp.lang.c++/2008-05/02889/does_std::string_have_something_like_CString::GetBuffer

Martin Dimitrov
  • 117
  • 1
  • 2
  • 1
    -1. `strtok()` works on a null-terminated string while `std::string`'s buffer is not required to be null-terminated. There is no way around `c_str()`. – SnakE Oct 29 '15 at 00:22
  • @SnakE `std::string`'s buffer *is* required to be null-terminated. `data` and `c_str` are required to be identical and [`data() + i == &operator[](i)` for every `i` in `[0, size()]`](http://en.cppreference.com/w/cpp/string/basic_string/c_str). – Alex Celeste Feb 13 '17 at 14:15
  • 1
    @Leushenko you're partially right. Null-termination is only guaranteed since C++11. I've added a note to the answer. I'll lift my -1 as soon as my edit is accepted. – SnakE Feb 14 '17 at 18:52
  • This hack is not worth it. This "elegant" solution wrecks std::string object in a few ways. `std::cout << str << " " << str.size(); std::cout << str.c_str()<< " " << strlen(str.c_str());` Before: `some-text-to-split 18 some-text-to-split 18` After: `sometexttosplit 18 some 4`. – dmitri Aug 04 '17 at 02:23
  • what is the use of "token = strtok( NULL, seps )" in the code above.Please answer coz tried to search this use but cudnot get much. – Chandra Shekhar May 07 '18 at 14:10
3

With C++17 str::string receives data() overload that returns a pointer to modifieable buffer so string can be used in strtok directly without any hacks:

#include <string>
#include <iostream>
#include <cstring>
#include <cstdlib>

int main()
{
    ::std::string text{"pop dop rop"};
    char const * const psz_delimiter{" "};
    char * psz_token{::std::strtok(text.data(), psz_delimiter)};
    while(nullptr != psz_token)
    {
        ::std::cout << psz_token << ::std::endl;
        psz_token = std::strtok(nullptr, psz_delimiter);
    }
    return EXIT_SUCCESS;
}

output

pop
dop
rop

user7860670
  • 35,849
  • 4
  • 58
  • 84
  • note: the original `std::string` will not hold the same value anymore, as strtok replaces the delimiter it found with a null terminator in place, instead of returning you a copy of the string. if you want to keep the original string, create a copy of the string and pass that into strtok. – user233009 Apr 04 '20 at 00:50
  • 1
    @user233009 note: if `strtok` handles only a single delimiter then the original value of the string may be preserved by putting back delimiter replacing null terminator on each iteration. – user7860670 Apr 04 '20 at 13:34
2

EDIT: usage of const cast is only used to demonstrate the effect of strtok() when applied to a pointer returned by string::c_str().

You should not use strtok() since it modifies the tokenized string which may lead to undesired, if not undefined, behaviour as the C string "belongs" to the string instance.

#include <string>
#include <iostream>

int main(int ac, char **av)
{
    std::string theString("hello world");
    std::cout << theString << " - " << theString.size() << std::endl;

    //--- this cast *only* to illustrate the effect of strtok() on std::string 
    char *token = strtok(const_cast<char  *>(theString.c_str()), " ");

    std::cout << theString << " - " << theString.size() << std::endl;

    return 0;
}

After the call to strtok(), the space was "removed" from the string, or turned down to a non-printable character, but the length remains unchanged.

>./a.out
hello world - 11
helloworld - 11

Therefore you have to resort to native mechanism, duplication of the string or an third party library as previously mentioned.

philant
  • 34,748
  • 11
  • 69
  • 112
  • casting away the const does not help. It is const for a reason. – Martin York Nov 14 '08 at 10:01
  • 1
    @Martin York, @Sherm Pendley : did you read the conclusion or only the code snippet ? I edited my answer to clarify what I wanted to show here. Rgds. – philant Nov 14 '08 at 16:14
  • 1
    @Philippe - Yes, I only read the code. A lot of people will do that, and go straight to the code and skip the explanation. Perhaps putting the explanation in the code, as a comment, would be a good idea? Anyhow, I removed my down vote. – Sherm Pendley Nov 14 '08 at 16:37
  • Does anybody know a compiler (Warning-switch) or a static code analyzer that warns about issues like this? – orbitcowboy Feb 21 '17 at 20:35
1

I suppose the language is C, or C++...

strtok, IIRC, replace separators with \0. That's what it cannot use a const string. To workaround that "quickly", if the string isn't huge, you can just strdup() it. Which is wise if you need to keep the string unaltered (what the const suggest...).

On the other hand, you might want to use another tokenizer, perhaps hand rolled, less violent on the given argument.

PhiLho
  • 40,535
  • 6
  • 96
  • 134
1

Assuming that by "string" you're talking about std::string in C++, you might have a look at the Tokenizer package in Boost.

Sherm Pendley
  • 13,556
  • 3
  • 45
  • 57
0

First off I would say use boost tokenizer.
Alternatively if your data is space separated then the string stream library is very useful.

But both the above have already been covered.
So as a third C-Like alternative I propose copying the std::string into a buffer for modification.

std::string   data("The data I want to tokenize");

// Create a buffer of the correct length:
std::vector<char>  buffer(data.size()+1);

// copy the string into the buffer
strcpy(&buffer[0],data.c_str());

// Tokenize
strtok(&buffer[0]," ");
Martin York
  • 257,169
  • 86
  • 333
  • 562
0

If you don't mind open source, you could use the subbuffer and subparser classes from https://github.com/EdgeCast/json_parser. The original string is left intact, there is no allocation and no copying of data. I have not compiled the following so there may be errors.

std::string input_string("hello world");
subbuffer input(input_string);
subparser flds(input, ' ', subparser::SKIP_EMPTY);
while (!flds.empty())
{
    subbuffer fld = flds.next();
    // do something with fld
}

// or if you know it is only two fields
subbuffer fld1 = input.before(' ');
subbuffer fld2 = input.sub(fld1.length() + 1).ltrim(' ');
0

Typecasting to (char*) got it working for me!

token = strtok((char *)str.c_str(), " "); 
  • This will not work. strtok will modifying the internal of str. I suppose it is a side effect the user doesn't want. The solution is to create a char buffer and copy first the str sting into the buffer. – Vivian De Smedt Nov 25 '20 at 15:58
0

Chris's answer is probably fine when using std::string; however in case you want to use std::basic_string<char16_t>, std::getline can't be used. Here is a possible other implementation:

template <class CharT> bool tokenizestring(const std::basic_string<CharT> &input, CharT separator, typename std::basic_string<CharT>::size_type &pos, std::basic_string<CharT> &token) {
    if (pos >= input.length()) {
        // if input is empty, or ends with a separator, return an empty token when the end has been reached (and return an out-of-bound position so subsequent call won't do it again)
        if ((pos == 0) || ((pos > 0) && (pos == input.length()) && (input[pos-1] == separator))) {
            token.clear();
            pos=input.length()+1;
            return true;
        }
        return false;
    }
    typename std::basic_string<CharT>::size_type separatorPos=input.find(separator, pos);
    if (separatorPos == std::basic_string<CharT>::npos) {
        token=input.substr(pos, input.length()-pos);
        pos=input.length();
    } else {
        token=input.substr(pos, separatorPos-pos);
        pos=separatorPos+1;
    }
    return true;
}

Then use it like this:

std::basic_string<char16_t> s;
std::basic_string<char16_t> token;
std::basic_string<char16_t>::size_type tokenPos=0;
while (tokenizestring(s, (char16_t)' ', tokenPos, token)) {
    ...
}
Jérôme
  • 19
  • 4
-1

It fails because str.c_str() returns constant string but char * strtok (char * str, const char * delimiters ) requires volatile string. So you need to use *const_cast< char > inorder to make it voletile. I am giving you a complete but small program to tokenize the string using C strtok() function.

   #include <iostream>
   #include <string>
   #include <string.h> 
   using namespace std;
   int main() {
       string s="20#6 5, 3";
       // strtok requires volatile string as it modifies the supplied string in order to tokenize it 
       char *str=const_cast< char *>(s.c_str());    
       char *tok;
       tok=strtok(str, "#, " );     
       int arr[4], i=0;    
       while(tok!=NULL){
           arr[i++]=stoi(tok);
           tok=strtok(NULL, "#, " );
       }     
       for(int i=0; i<4; i++) cout<<arr[i]<<endl;


       return 0;
   }

NOTE: strtok may not be suitable in all situation as the string passed to function gets modified by being broken into smaller strings. Pls., ref to get better understanding of strtok functionality.

How strtok works

Added few print statement to better understand the changes happning to string in each call to strtok and how it returns token.

#include <iostream>
#include <string>
#include <string.h> 
using namespace std;
int main() {
    string s="20#6 5, 3";
    char *str=const_cast< char *>(s.c_str());    
    char *tok;
    cout<<"string: "<<s<<endl;
    tok=strtok(str, "#, " );     
    cout<<"String: "<<s<<"\tToken: "<<tok<<endl;   
    while(tok!=NULL){
        tok=strtok(NULL, "#, " );
        cout<<"String: "<<s<<"\t\tToken: "<<tok<<endl;
    }
    return 0;
}

Output:

string: 20#6 5, 3

String: 206 5, 3    Token: 20
String: 2065, 3     Token: 6
String: 2065 3      Token: 5
String: 2065 3      Token: 3
String: 2065 3      Token: 

strtok iterate over the string first call find the non delemetor character (2 in this case) and marked it as token start then continues scan for a delimeter and replace it with null charater (# gets replaced in actual string) and return start which points to token start character( i.e., it return token 20 which is terminated by null). In subsequent call it start scaning from the next character and returns token if found else null. subsecuntly it returns token 6, 5, 3.

maximus
  • 72
  • 7
  • 1
    FYI: strtok will change the value of s. You should not use const_cast, since this simply hides an issue. – orbitcowboy Feb 21 '17 at 20:28
  • This causes undefined behaviour by using the result of `c_str()` to modify the string – M.M Mar 24 '19 at 09:02
  • @M.M added more clarification and working of the strtok function. Hope it will help people understard when to use it – maximus Mar 28 '19 at 19:22