3

The problem below has been simplified from real requirements.

Consider the following program:

#include <iostream>
#include <iterator>
#include <string>
#include <set>
#include <algorithm>

using namespace std;

typedef string T; // to simplify, always consider T as string

template<typename input_iterator>
void do_something(const input_iterator& first, const input_iterator& last) {
    const ostream_iterator<T> os(cout, "\n");
    const set<T> words(first, last);
    copy(words.begin(), words.end(), os);
}

int main(int argc, char** argv) {
    const istream_iterator<T> is(cin), eof;
    do_something(is, eof);
    return 0;
}

The program extracts all the words from an istream (cin) and does something with them. Each word is seperated by a white space by default. The logic behind the formatted extraction is inside the istream_iterator.

What I need to do now is to pass to do_something() two iterators so that the extracted words will be separated by a punctuation character instead of a white space (white spaces will be considered as "normal" characters). How would you do that in a "clean C++ way" (that is, with the minimum effort)?

Martin
  • 9,089
  • 11
  • 52
  • 87

1 Answers1

4

Although it isn't a priori obvious there is a relatively simple way to change what a stream considers to be whitespace. The way to do it is to imbue() the stream with a std::locale object whose std::ctype<char> facet is replaced to consider the desired characters as whitespace. imbue(), locale, ctype - huh?!? OK, well, these aren't necessarily the things you use day to day so here is a quick example which set's up std::cin to use comma and newline characters as spaced:

#include <locale>
template <char S0, char S1>
struct commactype_base {
    commactype_base(): table_() {
        this->table_[static_cast<unsigned char>(S0)] = std::ctype_base::space;
        this->table_[static_cast<unsigned char>(S1)] = std::ctype_base::space;
    }
    std::ctype<char>::mask table_[std::ctype<char>::table_size];
};
template <char S0, char S1 = S0>
struct ctype:
    commactype_base<S0, S1>,
    std::ctype<char>
{
    ctype(): std::ctype<char>(this->table_, false) {}
};

Actually, this particular implementation of std::ctype<char> can actually be used to use one or two arbitrary chars as spaces (a proper C++2011 version would probably allow an arbitrary number of arguments; also, the don't really have to be template argumentss). Anyway, with this in place, just drop the following line at the beginning of your main() function and you are all set:

std::cin.imbue(std::locale(std::locale(), new ::ctype<',', '\n'>));

Note that this really only considers , and \n as space characters. This also means that no other characters are skipped as whitespace. ... and, of course, a sequence of multiple comma characters is considered to be just one separator rather than possibly creating a bunch of empty strings. Also note that the above std::ctype<char> facet removes all other character classification. If you want to parse other objects than just strings you might want to retain the other character classification and only change that for spaces. Here is a way this could be done:

template <char S0, char S1>
struct commactype_base {
    commactype_base(): table_() {
        std::transform(std::ctype<char>::classic_table(),
                       std::ctype<char>::classic_table() + std::ctype<char>::table_size,
                       this->table_, 
                       [](std::ctype_base::mask m) -> std::ctype_base::mask {
                           return m & ~(std::ctype_base::space);
                       });
        this->table_[static_cast<unsigned char>(S0)] |= std::ctype_base::space;
        this->table_[static_cast<unsigned char>(S1)] |= std::ctype_base::space;
    }
    std::ctype<char>::mask table_[std::ctype<char>::table_size];
};

Sadly, this crashes with the version of gcc I have on my system (apparently the std::ctype<char>::classic_table() yields a null pointer. Compiling this with a current version of clang doesn't work because clang doesn't support lambda. With the two caveats the above code should be correct, though...

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • I'm not seeing where the rest of the table is filled in with the default values... wouldn't this break all non-space character types? – Ben Voigt Jan 27 '12 at 22:35
  • @Ben Voigt: I like my code to be correct but fortunately it is: the subtle key is the `: table_()`. – Dietmar Kühl Jan 27 '12 at 22:39
  • So `ctype::mask::mask()` fills in the table with the default type masks? Neat, but then the usual whitespace characters are still tagged as spaces, since you haven't changed them. – Ben Voigt Jan 27 '12 at 22:41
  • Wait, I'm not convinced. `ctype::mask` is a typedef for `char`, so everything is value-initialized to zero. That clears the "spaciness" of the usual whitespace characters, but it also breaks all other ctype categories. No characters will test as `upper`, `lower`, `digit`, `xdigit`, and so on. – Ben Voigt Jan 27 '12 at 22:46
  • ??? The default mask is empty. I create a table of empty masks and set two entries in the table up to be considered spaces. Using this particular `std::ctype` facet for something else than changing the meaning of what spaces are is bound not to be too successful but I didn't set out to do this. – Dietmar Kühl Jan 27 '12 at 22:46
  • So formatted input doesn't break when the facet has no characters in the `digit` category? – Ben Voigt Jan 27 '12 at 22:50
  • I didn't see you second comment: yes, this is correct. This facet won't do any character classification other than the classification for space. If you wanted to retain other character classification you would need to use a somewhat more involved approach to set up the table. I can show this as well if you want but it just amounts to 1. copy the table from the base, 2. clear the space bit, 3. set the space bit for the characters you want to have it set for. – Dietmar Kühl Jan 27 '12 at 22:51
  • Whether `digit` is needed depends on whether numbers are read. The example read `std::string`s. – Dietmar Kühl Jan 27 '12 at 22:52
  • In combination with http://stackoverflow.com/questions/5607589/right-way-to-split-an-stdstring-into-a-vectorstring, this solved my problem perfectly. – Samveen Apr 01 '13 at 11:56