Removing duplicates from Boost::Tokenizer?

Question

I am trying to split a comma-separated string and then perform some action on each token, but ignoring duplicates, so sth. along the following lines:

int main(int, char**)
{
   string text = "token, test   string";

  char_separator<char> sep(", ");
  tokenizer< char_separator<char> > tokens(text, sep);
  // remove duplicates from tokens?
  BOOST_FOREACH (const string& t, tokens) {
    cout << t << "." << endl;
  }
}

Is there a way to do this on the boost::tokenizer?

I know that I can solve this problem using boost::split and std::unique, but was wondering whether there is a way to achieve this with the tokenizer as well.

`std::unique` only works on sorted ranges, is your input always sorted? (If not, are you interested in filtering all duplicates, or just adjacent elements that are identical to each other) — Mankarse, Nov 23 '12 at 12:19
I'm pretty sure the answer is no -- tokenizer doesn't keep track of previous tokens, so it has no way to know whether the current token is new or duplicates a previous one. — Jerry Coffin, Nov 23 '12 at 12:47
@Mankarse: you are right, there is an additional call to std::sort that I make in the boost::split case. — tt293, Nov 23 '12 at 15:01

score 0 · Accepted Answer · answered Nov 25 '12 at 20:43

boost.tokenizer can do many cool things, but it cannot do this, the answer is indeed "no".

If you're only looking to drop adjacent duplicates, boost.range can help make it seemless:

#include <iostream>
#include <string>
#include <boost/range/adaptor/uniqued.hpp>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>

using namespace boost;
using namespace boost::adaptors;
int main()
{
    std::string text = "token, test   string test, test   test";

    char_separator<char> sep(", ");
    tokenizer< char_separator<char> > tokens(text, sep);
    BOOST_FOREACH (const std::string& t, tokens | uniqued ) {
        std::cout << t << "." << '\n';
    }
}

This prints:

token.
test.
string.
test.

In order to do some action only on globally unique tokens, you will need to store state, one way or another. The simplest solution is probably an intermediate set:

char_separator<char> sep(", ");
tokenizer< char_separator<char> > tokens(text, sep);
std::set<std::string> unique_tokens(tokens.begin(), tokens.end());
BOOST_FOREACH (const std::string& t, unique_tokens) {
        std::cout << t << "." << '\n';
}

Removing duplicates from Boost::Tokenizer?

1 Answers1