4

I want to read a text word by word, avoiding any non-alphanumeric characters in a simple way. After 'evolving' from text with white-spaces and '\n', I need to solve that problem in case there are also ',', '.' for example. The first case was simply solved by using getline with delimiter ' '. I wondered if there's a way to use getline with multiple delimiters, or even with some kind of regular expression (for example '.'|' '|','|'\n' ).

As far as I know, getline works in a way that it reads characters from the input stream, until either '\n' or delimiter character reached. My first guess was that it is quite simple to provide it with multiple delimiters, but I found out that it's not.

Edit: just as a clarification. Any C style (strtok for example, which is for my opinion very ugly) or algorithmic type of solution is not what I'm looking for. It is fairly easy to come up with a simple algorithm to solve that problem, and implement it. I'm looking for a more elegant solution, or at least an explanation for why can't we handle it with the getline function, since unless I completely misunderstood, should be able to somehow accept more than one delimiter.

Eliran Abdoo
  • 611
  • 6
  • 17
  • @GabeNones Eh, we can't keep yelling people for tagging both C and C++ and then close this C++ question as a dupe of that C question. We should find a C++ dupe. – Baum mit Augen Dec 07 '16 at 23:06
  • 1
    @BaummitAugen: Finding a C++ dupe would be all right--but the one you've closed it against isn't a particularly good dupe (at least IMO). One answer doesn't address this problem at all (it only deals with splitting a string, not reading from a stream as required here). The other does happen to work, but only sort of by coincidence (this does specify that `\n` should be a delimiter, but it won't work for anybody else who doesn't want that). – Jerry Coffin Dec 07 '16 at 23:24
  • @JerryCoffin The question seems to be the same though. If the other question needs better answers, one can still add one, it's not closed. – Baum mit Augen Dec 07 '16 at 23:27
  • @JerryCoffin If you can find a better dupe though, change it by all means. :) – Baum mit Augen Dec 07 '16 at 23:28
  • 1
    @BaummitAugen: I disagree--the other one only talks about the source being "some text", which could be a text file, or a string. He does show reading from a stream in the question, but it's not clear whether this is really required, or just an example of one possible source. This question is quite specific in asking about reading from a stream. – Jerry Coffin Dec 08 '16 at 00:20
  • 1
    @BaummitAugen: If I knew of a question this duplicated, I'd have already done that. I haven't found a precise duplicate (though many are at least somewhat similar). – Jerry Coffin Dec 08 '16 at 00:24
  • 1
    @JerryCoffin If you disagree with my vote that much, just reopen the question. I won't go on a revenge downvote spree, promised. ;) – Baum mit Augen Dec 08 '16 at 11:50

1 Answers1

6

There's good news and bad news. The good news is that you can do this.

The bad news is that doing it is fairly roundabout, and some people find it downright ugly and nasty.

To do it, you start by observing two facts:

  1. The normal string extractor uses whitespace to delimit "words".
  2. What constitutes white space is defined in the stream's locale.

Putting those together, the answer becomes fairly obvious (if circuitous): to define multiple delimiters, we define a locale that allows us to specify what characters should be treated as delimiters (i.e., white space):

struct word_reader : std::ctype<char> {
    word_reader(std::string const &delims) : std::ctype<char>(get_table(delims)) {}
    static std::ctype_base::mask const* get_table(std::string const &delims) {
        static std::vector<std::ctype_base::mask> rc(table_size, std::ctype_base::mask());

        for (char ch : delims)
            rc[ch] = std::ctype_base::space;
        return &rc[0];
    }
};

Then we need to tell the stream to use that locale (well, a locale with that ctype facet), passing the characters we want used as delimiters, and then extract words from the stream:

int main() {
    std::istringstream in("word1, word2. word3,word4");

    // create a ctype facet specifying delimiters, and tell stream to use it:
    in.imbue(std::locale(std::locale(), new word_reader(" ,.\n")));
    std::string word;

    // read words from the stream. Note we just use `>>`, not `std::getline`:
    while (in >> word)
        std::cout << word << "\n";
}

The result is what (I hope) you want: extracting each word without the punctuation we said was "white space".

word1
word2
word3
word4
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 1
    Well, that is indeed a solid solution, but as you mentioned pretty trivial and has some 'cheating' essence to it (replacing our required delimiters by white spaces). I wondered if there's a more elegant solution, which let's say takes exactly N operations, there N is the file length, just like `getline` manages to perform in case our delimiter world is narrowed to whitespaces and \n. – Eliran Abdoo Dec 07 '16 at 23:30
  • @GoldenSpecOps: We're not replacing anything. The stream is looking for the end of a word. It gets a character. Asks the locale: "is this whitespace"? Continues adding characters to the word until it reaches end of file, or the locale says: "yes, that's white space". Then it skips forward for as long as the locale keeps saying the next character is white space. Lather, Rinse, repeat. – Jerry Coffin Dec 08 '16 at 00:08
  • The only major difference from getline is that if you have something like `a\n\n\nz`, `getline` will read `a`, empty line, empty line, `z`, but `>>` will read it as just `a`, `z`. – Jerry Coffin Dec 08 '16 at 00:11
  • Can you please further explain your solution? - I did understand that imbuing a stringstream with a locale, results in a stream that has a customized output (in this case, avoiding each of the relevant characters). - I also understood the usage of the `std::locale` constructor which is the following `template< class Facet > locale( const locale& other, Facet* f );` But I did not manage to fully understand the word_reader struct, and I struggled to find relevant documentation about the requirements of the locale's ctor from the template Facet class. – Eliran Abdoo Dec 08 '16 at 09:46
  • 1
    This is a damn elegant solution, if not abusive of the `locale` class, however wouldn't this remove the standard locale settings? – Varad Mahashabde Sep 05 '19 at 19:41
  • 2
    @VaradMahashabde: it only affects the locale for that stream, and it uses a default constructed locale, with just the `ctype` facet replaced.So, it only affects how that stream classifies characters. – Jerry Coffin Sep 05 '19 at 19:49