2

So I have the following string of data, which is being received through a TCP winsock connection, and would like to do an advanced tokenization, into a vector of structs, where each struct represents one record.

std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n"

struct table_t
{
    std::string key;
    std::string first;
    std::string last;
    std::string rank;
    std::additional;
};

Each record in the string is delimited by a carriage return. My attempt at splitting up the records, but not yet splitting up the fields:

    void tokenize(std::string& str, std::vector< string >records)
{
    // Skip delimiters at beginning.
    std::string::size_type lastPos = str.find_first_not_of("\n", 0);
    // Find first "non-delimiter".
    std::string::size_type pos     = str.find_first_of("\n", lastPos);
    while (std::string::npos != pos || std::string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        records.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of("\n", pos);
        // Find next "non-delimiter"
        pos = str.find_first_of("\n", lastPos);
    }
}

It seems totally unnecessary to repeat all of that code again to further tokenize each record via the colon (internal field separator) into the struct and push each struct into a vector. I'm sure there is a better way of doing this, or perhaps the design is in itself wrong.

Thank you for any help.

rem45acp
  • 469
  • 5
  • 16
  • 1
    If you can use boost, this would be rather neatly done with its tokenizer library, its string algorithm library, or, for the most robust solution, with `boost.spirit`, as in here: http://www.boost.org/doc/libs/1_46_1/libs/spirit/doc/html/spirit/qi/tutorials/employee___parsing_into_structs.html – Cubbi Mar 28 '11 at 16:32
  • missed this comment. +1 for spirit altho that is too heavy for the data format used in this case – user237419 Mar 28 '11 at 16:40
  • use [boost::tokenizer](http://www.boost.org/doc/libs/1_46_1/libs/tokenizer/index.html) – user237419 Mar 28 '11 at 16:38

2 Answers2

2

My solution:

struct colon_separated_only: std::ctype<char> 
{
    colon_separated_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        typedef std::ctype<char> cctype;
        static const cctype::mask *const_rc= cctype::classic_table();

        static cctype::mask rc[cctype::table_size];
        std::memcpy(rc, const_rc, cctype::table_size * sizeof(cctype::mask));

        rc[':'] = std::ctype_base::space; 
        return &rc[0];
    }
};

struct table_t
{
    std::string key;
    std::string first;
    std::string last;
    std::string rank;
    std::string additional;
};

int main() {
        std::string buf = "44:william:adama:commander:stuff\n33:luara:roslin:president:data\n";
        stringstream s(buf);
        s.imbue(std::locale(std::locale(), new colon_separated_only()));
        table_t t;
        std::vector<table_t> data;
        while ( s >> t.key >> t.first >> t.last >> t.rank >> t.additional )
        {
           data.push_back(t);
        }
        for(size_t i = 0 ; i < data.size() ; ++i )
        {
           cout << data[i].key <<" ";
           cout << data[i].first <<" "<<data[i].last <<" ";
           cout << data[i].rank <<" "<< data[i].additional << endl;
        }
        return 0;
}

Output:

44 william adama commander stuff
33 luara roslin president data

Online Demo : http://ideone.com/JwZuk


The technique I used here is described in my another solution to different question:

Elegant ways to count the frequency of words in a file

Community
  • 1
  • 1
Nawaz
  • 353,942
  • 115
  • 666
  • 851
1

For breaking the string up into records, I'd use istringstream, if only because that will simplify the changes later when I want to read from a file. For tokenizing, the most obvious solution is boost::regex, so:

std::vector<table_t> parse( std::istream& input )
{
    std::vector<table_t> retval;
    std::string line;
    while ( std::getline( input, line ) ) {
        static boost::regex const pattern(
            "\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\):\([^:]*\)" );
        boost::smatch matched;
        if ( !regex_match( line, matched, pattern ) ) {
            //  Error handling...
        } else {
            retval.push_back(
                table_t( matched[1], matched[2], matched[3],
                         matched[4], matched[5] ) );
        }
    }
    return retval;
}

(I've assumed the logical constructor for table_t. Also: there's a very long tradition in C that names ending in _t are typedef's, so you're probably better off finding some other convention.)

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • you should ping Siek and tell him ::tokenizer is useless since with regex you can do anything. obviously – user237419 Mar 28 '11 at 16:46
  • @adirau He asked how to avoid duplication. Using an existing tool is the obvious solution. In this case, it's also clearly the simplest solution (at least if you want to check for errors). – James Kanze Mar 28 '11 at 17:31
  • avoid duplication by code reuse; can't say that if you use getline as the first tokenizer and regex as the second tokenizer you're avoiding duplication ;) not the simplest, not the obvious and not even if you want to check for errors; that regex will accept errors at token level; if he needs error checking maybe ::spirit is a better solution as Cubbi mentioned in the first comment – user237419 Mar 28 '11 at 17:43
  • boost::spirit is overkill for something this simple. (In fact, for just about everything I've seen where it isn't overkill, an external parser generator would be better.) The regular expression solution is simple and easily understood. (I considered other solutions, but just posted the simplest.) – James Kanze Mar 28 '11 at 17:47
  • ::spirit is in fact fast but you have a point it doesn't apply here – user237419 Mar 28 '11 at 18:13
  • @adirau The resulting parser may be fast, but the compile times sure aren't:-). Generating a parser is better handled by an external code generator than with template meta-programming. – James Kanze Mar 29 '11 at 09:15
  • I had no idea that what I was looking for would be so complicated! This works fine. I discovered I could also make things simpler by not separating each record with a line break, and just keep a note of how many fields should be expected, by reversing the struct and putting vectors inside it. – rem45acp Mar 29 '11 at 12:03