Fast String tokenization in C/C++

Question

I'm working on a C/C++ app (in Visual Studio 2010) where I need to tokenize a comma delimited string and I would like this to be as fast as possible. Currently I'm using strtok_s. I ran some tests of strtok_s versus sscanf and it seemed like strtok_s was faster (unless I wrote a terrible implementation :) ) but I was wondering if anyone could suggest a faster alternative.

In C, I don't think you can find a faster alternative to `strtok()` (or even `strtok_s()` -- whatever that is). If `strtok()` suits you, use it. — pmg, Apr 14 '11 at 15:53
Nitpick: There's no such language called "C/C++." Is it C or C++? They are different. — John Dibling, Apr 14 '11 at 15:59
Thanks for the fast answers guys, initially I had the same train of thought as pmg. To be thorough however I am going to prototype these answers and then time them and I will post the results for future reference. — mmhmm, Apr 14 '11 at 16:00
The best practices for many things are different between C and C++. If you're working in one, you want the best practice for that language. If you're working in both, you probably want two different best practices. — David Thornley, Apr 14 '11 at 16:51
@David ya exactly, I was basically trying to ask what these are and which is the fastest. I am just curious to compare them since both options are available to me and the bottom line is I need it to be as fast as possible. I will post results later thx :) — mmhmm, Apr 14 '11 at 18:01

ildjarn · Answer 1 · 2012-03-21T03:53:30.137

6

For sheer runtime speed, boost.spirit.qi is an excellent candidate.

edited Mar 21 '12 at 03:53

answered Apr 14 '11 at 15:40

ildjarn

62,044
9
127
211

score 5 · Answer 2 · answered Apr 14 '11 at 15:48

The best you can do is to make sure you go through the string only once, and build the output on the fly. Start pulling off chars in to a temp buffer, and when you encounter the delimiter save the temp buffer to the output collection, clear the temp buffer, rinse and repeat.

Here's a basic implementation that does this.

template<class C=char>
struct basic_token
{
    typedef std::basic_string<C> token_string;
    typedef unsigned long size_type;
    token_string token_, delim_;
    basic_token(const token_string& token, const token_string& delim = token_string());
};

template<class C>
basic_token<C>::basic_token(const token_string& token, const token_string& delim)
:   token_(token),
    delim_(delim)
{
}

typedef basic_token<char> token;

template<class Char, class Iter> void tokenize(const std::basic_string<Char>& line, const Char* delims, Iter itx)
{
    typedef basic_token<Char> Token;
    typedef std::basic_string<Char> TString;

    for( TString::size_type tok_begin = 0, tok_end = line.find_first_of(delims, tok_begin);
        tok_begin != TString::npos; tok_end = line.find_first_of(delims, tok_begin) )
    {
        if( tok_end == TString::npos )
        {
            (*itx++) = Token(TString(&line[tok_begin]));
            tok_begin = tok_end;
        }
        else
        {
            (*itx++) = Token(TString(&line[tok_begin], &line[tok_end]), TString(1, line[tok_end]));
            tok_begin = tok_end + 1;
        }
    }
}

template<class Char, class Iter> void tokenize(const Char* line, const Char* delim, Iter itx)
{
    tokenize(std::basic_string<Char>(line), delim, itx);
}
template<class Stream, class Token> Stream& operator<<(Stream& os, const Token& tok)
{
    os << tok.token_ << "\t[" << tok.delim_ << "]";
    return os;
}

...which you would use like this:

string raw = "35=BW|49=TEST|1346=REQ22|1355=2|1182=88500|1183=88505|10=087^";
vector<stoken> tokens;
tokenize(raw, "|", back_inserter(tokens));
copy(tokens.begin(), tokens.end(), ostream_iterator<stoken>(cout, "\n"));

Output is:

35=BW   [|]
49=TEST [|]
1346=REQ22      [|]
1355=2  [|]
1182=88500      [|]
1183=88505      [|]
10=087^ []

StereoMatching · Answer 3 · 2016-04-06T02:16:53.363

The test of mmhmm haven't make use of spirit correctly, his grammars are flaw.

#include <cstdio> 
#include <cstring>   

#include <iostream>
#include <string>

#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/fusion/include/io.hpp>

#include <boost/spirit/include/qi.hpp>    

/****************************strtok_r************************/
typedef struct sTokenDataC {
    char *time;
    char *symb;
    float bid;
    float ask;
    int bidSize;
    int askSize;
} tokenDataC;

tokenDataC parseTick( char *line, char *parseBuffer )
{
    tokenDataC tokenDataOut;

    tokenDataOut.time = strtok_r( line,",", &parseBuffer );
    tokenDataOut.symb = strtok_r( nullptr,",", &parseBuffer );
    tokenDataOut.bid = atof(strtok_r( nullptr,",", &parseBuffer ));
    tokenDataOut.ask = atof(strtok_r( nullptr , ",", &parseBuffer ));
    tokenDataOut.bidSize = atoi(strtok_r( nullptr,",", &parseBuffer ));
    tokenDataOut.askSize = atoi(strtok_r( nullptr, ",", &parseBuffer  ));

    return tokenDataOut;
}

void test_strcpy_s(int iteration)
{
    char *testStringC = new char[64];
    char *lineBuffer = new char[64];

    printf("test_strcpy_s....\n");
    strcpy(testStringC,"09:30:00,TEST,13.24,15.32,10,14");
    {
        timeEstimate<> es;
        tokenDataC tokenData2;
        for(int i = 0; i < iteration; i++)
        {
            strcpy(lineBuffer, testStringC);//this is more realistic since this has to happen because I want to preserve the line
            tokenData2 = parseTick(lineBuffer, testStringC);
            //std::cout<<*tokenData2.time<<", "<<*tokenData2.symb<<",";
            //std::cout<<tokenData2.bid<<", "<<tokenData2.ask<<", "<<tokenData2.bidSize<<", "<<tokenData2.askSize<<std::endl;
        }
    }

    delete[] lineBuffer;
    delete[] testStringC;
}
/****************************strtok_r************************/

/****************************spirit::qi*********************/
namespace qi = boost::spirit::qi;

struct tokenDataCPP
{
    std::string time;
    std::string symb;
    float bid;
    float ask;
    int bidSize;
    int askSize;

    void clearTimeSymb(){
        time.clear();
        symb.clear();
    }
};

BOOST_FUSION_ADAPT_STRUCT(
        tokenDataCPP,
        (std::string, time)
        (std::string, symb)
        (float, bid)
        (float, ask)
        (int, bidSize)
        (int, askSize)
        )

void test_spirit_qi(int iteration)
{
    std::string const strs("09:30:00,TEST,13.24,15.32,10,14");
    tokenDataCPP data;        

    auto myString = *~qi::char_(",");
    auto parser = myString >> "," >> myString >> "," >> qi::float_ >> "," >> qi::float_ >> "," >> qi::int_  >> "," >> qi::int_;
    {
        std::cout<<("test_spirit_qi....\n");
        timeEstimate<> es;
        for(int i = 0; i < iteration; ++i){
            qi::parse(std::begin(strs), std::end(strs), parser, data);
            //std::cout<<data.time<<", "<<data.symb<<", ";
            //std::cout<<data.bid<<", "<<data.ask<<", "<<data.bidSize<<", "<<data.askSize<<std::endl;
            data.clearTimeSymb();
        }
    }
}
/****************************spirit::qi*********************/

int main()
{
    int const ITERATIONS = 500 * 10000;
    test_strcpy_s(ITERATIONS);
    test_spirit_qi(ITERATIONS);
}

Since clang++ don't have strtok_s, I use strtok_r to replace it Iterate 500 * 10k, the times are

test_strcpy_s : 1.40951
test_spirit_qi : 1.34277

Their times are almost the same, not much different.

compiler, clang++ 3.2, -O2

codes of timeEstime

score 1 · Answer 4 · answered Sep 08 '18 at 11:26

This should be quite fast, no temp buffers, it allocates empty tokes too.

template <class char_t, class char_traits_t, 
        class char_allocator_t, class string_allocator_t>
inline void _tokenize(
    const std::basic_string<char_t, char_traits_t, char_allocator_t>& _Str, 
    const char_t& _Tok, 
    std::vector<std::basic_string<char_t, char_traits_t, char_allocator_t>, 
        string_allocator_t>& _Tokens, 
    const size_t& _HintSz=10)
{
    _Tokens.reserve(_HintSz);

    const char_t* _Beg(&_Str[0]), *_End(&_Str[_Str.size()]); 

    for (const char_t* _Ptr=_Beg; _Ptr<_End; ++_Ptr)
    {
        if (*_Ptr == _Tok)
        {
            _Tokens.push_back(
                std::basic_string<char_t, char_traits_t,
                    char_allocator_t>(_Beg, _Ptr));

            _Beg = 1+_Ptr;
        }
    }

    _Tokens.push_back(
        std::basic_string<char_t, char_traits_t,
            char_allocator_t>(_Beg, _End));
}

score 1 · Answer 5 · answered Apr 14 '11 at 16:48

1

I would remind you that there is a risk with strtok and its ilk that you can get back a different number of tokens than you might want.

one|two|three  would yield 3 tokens

while

one|||three    would yield 2.

answered Apr 14 '11 at 16:48

EvilTeach

28,120
21
85
141

score -1 · Answer 6 · answered Apr 15 '11 at 16:38

After testing and timing each suggested candidate, the result is that strtok is clearly the fastest. Although I'm not surprised my love of testing dictated it was worth exploring other options. [Note: Code was thrown together edits welcome :) ]

Given:

typedef struct sTokenDataC {
    char *time;
    char *symb; 
    float bid;
    float ask;
    int bidSize;
    int askSize;
} tokenDataC;

tokenDataC parseTick( char *line, char *parseBuffer )
{
    tokenDataC tokenDataOut;

    tokenDataOut.time = strtok_s( line,",", &parseBuffer );
    tokenDataOut.symb = strtok_s( null,",", &parseBuffer );
    tokenDataOut.bid = atof(strtok_s( null,",", &parseBuffer ));
    tokenDataOut.ask = atof(strtok_s( null , ",", &parseBuffer ));
    tokenDataOut.bidSize = atoi(strtok_s( null,",", &parseBuffer ));
    tokenDataOut.askSize = atoi(strtok_s( null, ",", &parseBuffer  ));

    return tokenDataOut; 
}

char *testStringC = new char[64];
    strcpy(testStringC,"09:30:00,TEST,13.24,15.32,10,14");

int _tmain(int argc, _TCHAR* argv[])
{
char *lineBuffer = new char[64];
    printf("Testing method2....\n");
    for(int i = 0; i < ITERATIONS; i++)
    {
        strcpy(lineBuffer,testStringC);//this is more realistic since this has to happen because I want to preserve the line
        tokenData2 = parseTick(lineBuffer,parseBuffer);

    }
}

vs calling John Diblings impl via:

    struct sTokenDataCPP
    {
        std::basic_string<char> time;
        std::basic_string<char> symb; 
        float bid;
        float ask;
        int bidSize;
        int askSize;
    };
        std::vector<myToken> tokens1;
            tokenDataCPP tokenData;
            printf("Testing method1....\n");
            for(int i = 0; i < ITERATIONS; i++)
            {
tokens1.clear();
                tokenize(raw, ",", std::back_inserter(tokens1));

                tokenData.time.assign(tokens1.at(0).token_);
                tokenData.symb.assign(tokens1.at(1).token_);
                tokenData.ask = atof(tokens1.at(2).token_.c_str());
                tokenData.bid = atof(tokens1.at(3).token_.c_str());
                tokenData.askSize = atoi(tokens1.at(4).token_.c_str());
                tokenData.bidSize = atoi(tokens1.at(5).token_.c_str());

            }

vs a simple boost.spirit.qi implementation defining the grammer as follows:

template <typename Iterator>
struct tick_parser : grammar<Iterator, tokenDataCPP(), boost::spirit::ascii::space_type>
{

    tick_parser() : tick_parser::base_type(start)
    {
        my_string %= lexeme[+(boost::spirit::ascii::char_ ) ];

        start %=
            my_string >> ','
            >>  my_string >> ','
            >>  float_ >> ','
            >>  float_ >> ','
            >>  int_ >> ','
            >>  int_
            ;
    }

    rule<Iterator, std::string(), boost::spirit::ascii::space_type> my_string;
    rule<Iterator, sTokenDataCPP(), boost::spirit::ascii::space_type> start;
};

with ITERATIONS set to 500k: strtok version: 2s john's version: 115s boost: 172s

I can post the full code is people want this, I just didn't want to take up a huge amt of space

Roll your own version of `strtok()` based on `strchr()` for an even faster version. You can make the prototype simpler: `char *strtokchr(char *src, int ch);`; you don't need to code for several "break" characters; maybe you can deal with successive "break" characters differently ... — pmg, Apr 15 '11 at 17:34
@pmg wrote a function based on strchr that wrote directly to my target struct and was able to do 1 mill iterations in 1 sec. awesome thx :) — mmhmm, Apr 15 '11 at 19:56
YAY! Nice. But don't name that function like I did in my previous comment. That identifier starts with `str` and a lowercase letter so it's reserved for the implementation. Try `chrstrtok` :) — pmg, Apr 15 '11 at 20:24
Your codes can't run, especially tye grammar of spirit, it is impossible to compile.1 : namespace error--you didn't add namespace before your rules; 2 : the synthesis attribute of the struct and the "start" rule should be the same type. 3 : the grammar of the spirit is not doing the same thing as strtok_s. 4 : your grammar is buggy, it would parse all of the chars but not the pattern you want.Conclusion : test your results carefully before you post your results — StereoMatching, Oct 24 '13 at 03:41

Fast String tokenization in C/C++

6 Answers6