52

I've redirected "cin" to read from a file stream cin.rdbug(inF.rdbug()) When I use the extraction operator it reads until it reaches a white space character.

Is it possible to use another delimiter? I went through the api in cplusplus.com, but didn't find anything.

Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288
yotamoo
  • 5,334
  • 13
  • 49
  • 61

4 Answers4

53

It is possible to change the inter-word delimiter for cin or any other std::istream, using std::ios_base::imbue to add a custom ctype facet.

If you are reading a file in the style of /etc/passwd, the following program will read each :-delimited word separately.

#include <locale>
#include <iostream>


struct colon_is_space : std::ctype<char> {
  colon_is_space() : std::ctype<char>(get_table()) {}
  static mask const* get_table()
  {
    static mask rc[table_size];
    rc[':'] = std::ctype_base::space;
    rc['\n'] = std::ctype_base::space;
    return &rc[0];
  }
};

int main() {
  using std::string;
  using std::cin;
  using std::locale;

  cin.imbue(locale(cin.getloc(), new colon_is_space));

  string word;
  while(cin >> word) {
    std::cout << word << "\n";
  }
}
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • 1
    Using `new` in uncontrolled way is evil, needless to say that you have not `delete` your struct (and there is no way to delete an unnamed pointer). ALWAYS try `shared_ptr` instead when possible. – Earth Engine Apr 03 '13 at 11:56
  • 32
    That is generally excellent advice which does not apply in this specific case. In this case, `std::facet` is a refernce-counted pointer, `std::locale::locale` requires a raw pointer, not a shared pointer, and `std::locale::~locale` is defined to `delete` the facet pointer. If you have a problem with the interface to `locale`, take it up with the standards committee, not me. See the example program at http://en.cppreference.com/w/cpp/locale/locale/locale – Robᵩ Apr 03 '13 at 13:20
  • 3
    Even though I will suggest to define a wrapper function `get_locale` to wrap those unusual use of `new` with comments. So the code reviewer will realize there are something wrong with the interface, not the code writer. And this is what I mean for "controled" way of using `new`. – Earth Engine Apr 04 '13 at 00:02
  • 6
    If not creating new functions, a better way to represent the ownership transfer could be `unique_ptr(new colon is_space).release()`. Although it is basically the same thing of your code but more verbose, it indicates that you are transferring pointer ownership. – Earth Engine Apr 04 '13 at 01:47
24

For strings, you can use the std::getline overloads to read using a different delimiter.

For number extraction, the delimiter isn't really "whitespace" to begin with, but any character invalid in a number.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • I'm not sure how you can say the delimiter isn't "whitespace" for numbers, if `foo` is an `int`, `istringstream("123 456") >> foo;` puts `123` in `foo`, not `123456`. – Jonathan Mee Jan 28 '15 at 18:19
  • 1
    @JonathanMee: I didn't say that whitespace aren't delimiters, I said the set of delimiters is not only whitespace. Try `istringstream("123_456") >> foo;` or Try `istringstream("123|456") >> foo;` – Ben Voigt Jan 28 '15 at 19:08
  • Ahhh, I understand, you're saying that rather than looking for a character defined as `ctype_base::space` the stream is looking for a character not defined as `ctype_base::digit`. – Jonathan Mee Jan 28 '15 at 19:29
  • 1
    @JonathanMee: Right, although it's more complex than that, some punctuation characters are allowed during numeric parsing. And obviously whether it is classified as a space may affect the status flags, but whitespace is not the only thing that causes numeric extraction to stop. – Ben Voigt Jan 28 '15 at 19:32
  • Does it make sense to expect that `std::getline` is optimized for performance? – Wolf Sep 22 '15 at 11:29
  • 1
    @Wolf streams in general are one of the least performant things in the standard. But typically you're going to use streams with input/output so slow performance will be negligible relative to the cost of the input/output operation. For performance reasons though arrays should be preferred over streams. – Jonathan Mee Feb 09 '16 at 15:08
  • 1
    @JonathanMee: "slow performance will be negligible relative to the cost of the input/output operation" has NEVER been true in my experience. The fact is that in many applications both file I/O and parsing are negligible compared to the cost of other processing, or waiting for the user to hit the start button, or network requests. But in I/O heavy applications built with iostreams, it's the iostream code, not the I/O operations, that dominates. – Ben Voigt Feb 09 '16 at 15:41
  • 1
    Hmmm... I guess it's the type of project that I have a history with. Thanks for the clarification. It's good to have a balancing point of view. I suppose a better answer for @Wolf's question would be: "`getline` is no slower than the stream is as a whole, but if performance is a concern for you, you should look for non-stream options." – Jonathan Mee Feb 09 '16 at 15:58
19

This is an improvement on Robᵩ's answer, because that is the right one (and I'm disappointed that it hasn't been accepted.)

What you need to do is change the array that ctype looks at to decide what a delimiter is.

In the simplest case you could create your own:

const ctype<char>::mask foo[ctype<char>::table_size] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ctype_base::space};

On my machine '\n' is 10. I've set that element of the array to the delimiter value: ctype_base::space. A ctype initialized with foo would only delimit on '\n' not ' ' or '\t'.

Now this is a problem because the array passed into ctype defines more than just what a delimiter is, it also defines leters, numbers, symbols, and some other junk needed for streaming. (Ben Voigt's answer touches on this.) So what we really want to do is modify a mask, not create one from scratch.

That can be accomplished like this:

const auto temp = ctype<char>::classic_table();
vector<ctype<char>::mask> bar(temp, temp + ctype<char>::table_size);

bar[' '] ^= ctype_base::space;
bar['\t'] &= ~(ctype_base::space | ctype_base::cntrl);
bar[':'] |= ctype_base::space;

A ctype initialized with bar would delimit on '\n' and ':' but not ' ' or '\t'.

You go about setting up cin, or any other istream, to use your custom ctype like this:

cin.imbue(locale(cin.getloc(), new ctype<char>(data(bar))));

You can also switch between ctypes and the behavior will change mid-stream:

cin.imbue(locale(cin.getloc(), new ctype<char>(foo)));

If you need to go back to default behavior, just do this:

cin.imbue(locale(cin.getloc(), new ctype<char>));

Live example

Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288
  • that will set `bar['\t']` to zero, probably not intended. To clear a bit, use `&~` (bit-wise AND with bit-wise NOT). `!` is logical NOT and won't have the desired effect. – Ben Voigt Jan 29 '15 at 02:02
  • @BenVoigt Thank you, I wanted to strip out the `space` and `cntrl` bits and I accidentally got everything. – Jonathan Mee Jan 29 '15 at 03:15
5

This is an improvement on Jon's answer, and the example from cppreference.com. So this follows the same premise as both, but combines them with parameterized delimiters.

struct delimiter_ctype : std::ctype<char> {
    static const mask* make_table(std::string delims)
    {
        // make a copy of the "C" locale table
        static std::vector<mask> v(classic_table(), classic_table() + table_size);
        for(mask m : v){
            m &= ~space;
        }
        for(char d : delims){
            v[d] |= space;
        }
        return &v[0];
    }
    delimiter_ctype(std::string delims, ::size_t refs = 0) : ctype(make_table(delims), false, refs) {}
};

Cheers!

Josh C
  • 1,035
  • 2
  • 14
  • 27