5

I have a std::wstring variable that contains a text and I need to split it by separator. How could I do this? I wouldn't use boost that generate some warnings. Thank you

EDIT 1 this is an example text:

hi how are you?

and this is the code:

typedef boost::tokenizer<boost::char_separator<wchar_t>, std::wstring::const_iterator, std::wstring> Tok;

boost::char_separator<wchar_t> sep;

Tok tok(this->m_inputText, sep);

for(Tok::iterator tok_iter = tok.begin(); tok_iter != tok.end(); ++tok_iter)
{
    cout << *tok_iter;
}

the results are:

  1. hi
  2. how
  3. are
  4. you
  5. ?

I don't understand why the last character is always splitted in another token...

Dmitry
  • 877
  • 1
  • 16
  • 30
Stefano
  • 3,213
  • 9
  • 60
  • 101
  • 1
    possible duplicate of [How do I tokenize a string in C++?](http://stackoverflow.com/questions/53849/how-do-i-tokenize-a-string-in-c), a number of methods are covered both with and without boost – Ben Voigt Mar 24 '11 at 20:30

4 Answers4

7

In your code, question mark appears on a separate line because that's how boost::tokenizer works by default.

If your desired output is four tokens ("hi", "how", "are", and "you?"), you could

a) change char_separator you're using to

boost::char_separator<wchar_t> sep(L" ", L"");

b) use boost::split which, I think, is the most direct answer to "split a wstring by specified character"

#include <string>
#include <iostream>
#include <vector>
#include <boost/algorithm/string.hpp>

int main()
{

        std::wstring m_inputText = L"hi how are you?";

        std::vector<std::wstring> tok;
        split(tok, m_inputText, boost::is_any_of(L" "));

        for(std::vector<std::wstring>::iterator tok_iter = tok.begin();
                        tok_iter != tok.end(); ++tok_iter)
        {
                std::wcout << *tok_iter << '\n';
        }

}

test run: https://ideone.com/jOeH9

Cubbi
  • 46,567
  • 13
  • 103
  • 169
2

You're default constructing boost::char_separator. The documentation says:

The function std::isspace() is used to identify dropped delimiters and std::ispunct() is used to identify kept delimiters. In addition, empty tokens are dropped.

Since std::ispunct(L'?') is true, it is treated as a "kept" delimiter, and reported as a separate token.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
1

Hi you can use wcstok function

Sanja Melnichuk
  • 3,465
  • 3
  • 25
  • 46
1

You said you don't want boost so...

This is maybe a wierd approach to use in C++ but I used it one in a MUD where i needed a lot of tokenization in C.

take this block of memory assigned to the char * chars:

char chars[] = "I like to fiddle with memory";

If you need to tokenize on a space character:

create array of char* called splitvalues big enough to store all tokens
while not increment pointer chars and compare value to '\0'
  if not already set set address of splitvalues[counter] to current memory address - 1
     if value is ' ' write 0 there
       increment counter

when you finish you have the original string destroyed so do not use it, instead you have the array of strings pointing to the tokens. the count of tokens is the counter variable (upperbound of the array).

the approach is this:

  • iterate the string and on first occurence update token start pointer
  • convert the char you need to split on to zeroes that mean string termination in C
  • count how many times you did this

PS. Not sure if you can use a similar approach in a unicode environment tough.

Marino Šimić
  • 7,318
  • 1
  • 31
  • 61