Splitting a String in C++ (using cin)

Question

I'm doing THIS UVa problem, which takes in the following input:

This is fun-
ny!  Mr.P and I've never seen
this ice-cream flavour
before.Crazy eh?
#
This is fun-
ny!  Mr.P and I've never seen
this ice-cream flavour
before.Crazy eh?
#

and produces this output:

In the input, # divides the cases. I'm supposed to get the length of each word and count the frequency of each different length (as you see in the output, a word of length 1 occurs once, length 2 occurs three times, 3 occurs twice, and so on).

My problem is this: When reading in cin, before.Crazy is counted as one word, since there is no space dividing them. It should then be as simple as splitting the string on certain punctuation ({".",",","!","?"} for example)...but C++ seems to have no simple way to split the string.

So, my question: How can I split the string and send in each returned string to my function that handles the rest of the problem?

Here's my code:

int main()
{
    string input="";
    while(cin.peek()!=-1)
    {   
        while(cin >> input && input!="#")
        {
            lengthFrequency(input);
            cout << input << " " << input.length() << endl;
        }

        if(cin.peek()!=-1) cout << endl;
        lengthFrequencies.clear();
    }
    return 0;
}

lengthFrequency is a map<int,int>.

I don't see any words of length one in the example input.. – David G Oct 25 '13 at 01:15 — David G, Oct 25 '13 at 01:15

score 4 · Answer 1 · answered Oct 25 '13 at 01:26

4

You can redefine what a stream considers to be a whitespace character using a std::locale with a custom std::ctype<char> facet. Here is corresponding code which doesn't quite do the assignment but demonstrates how to use the facet:

#include <algorithm>
#include <iostream>
#include <locale>
#include <string>

struct ctype
    : std::ctype<char>
{
    typedef std::ctype<char> base;
    static base::mask const* make_table(char const* spaces,
                                        base::mask* table)
    {
        base::mask const* classic(base::classic_table());
        std::copy(classic, classic + base::table_size, table);
        for (; *spaces; ++spaces) {
            table[int(*spaces)] |= base::space;
        }
        return table;
    }
    ctype(char const* spaces)
        : base(make_table(spaces, table))
    {
    }
    base::mask table[base::table_size];
};

int main()
{
    std::cin.imbue(std::locale(std::locale(), new ctype(".,!?")));
    for (std::string s; std::cin >> s; ) {
        std::cout << "s='" << s << "'\n";
    }
}

answered Oct 25 '13 at 01:26

Dietmar Kühl

150,225
13
225
380

What are you? A wizard?! That's awesome! – jrd1 Oct 25 '13 at 01:35
What about the hyphens and `#`? – David G Oct 25 '13 at 01:37
1

@jrd1: At some point I have implemented locales and IOStreams for my experimental standard C++ library implementation [CXXRT](http://www.dietmar-kuehl.de/cxxrt/). As a result I know what classes there are and how to use them to customize IOStreams. – Dietmar Kühl Oct 25 '13 at 01:37
@DietmarKühl: Good stuff. Thanks for explanation and link to your library! :) – jrd1 Oct 25 '13 at 01:39
@0x499602D2: I excluded the characters listed in the example: `({".",",","!","?"} for example)`. You can use different characters when constructing the custom `ctype` facet, e.g. `new ctype("#.,!?-")`. – Dietmar Kühl Oct 25 '13 at 01:40
1

@jrd1: note that the library is rather unmaintained since about a decade. I hope to eventually get around to posting an update but so far I haven't found the time... – Dietmar Kühl Oct 25 '13 at 01:41
This is a little over my head since I'm not very fluent in C++ yet, but I'll give it a shot. Thanks for the through and impressive answer! – muttley91 Oct 25 '13 at 01:44
@DietmarKühl: That's still awesome though. :) Thanks for sharing! – jrd1 Oct 25 '13 at 01:45

score 0 · Answer 2 · answered Oct 25 '13 at 01:32

0

Before counting the frequencies, you could parse the input string and replace all the {".",",","!","?"} characters with spaces (or whatever separation character you want to use). Then your existing code should work.

You may want to handle some characters differently. For example, in the case of before.Crazy you would replace the '.' with a space, but for something like 'ny! ' you would remove the '!' altogether because it is already followed by a space.

answered Oct 25 '13 at 01:32

Andrew Crawford

196
5

This would probably work actually. Just the complexity of differentiating between which to put a space and not, but not too bad. Thanks. – muttley91 Oct 25 '13 at 01:44

score 0 · Answer 3 · edited May 23 '17 at 12:29

How about this (using the STL, comparators and functors)?

NOTE: All assumptions and explanations are in the source code itself.

#include <iostream>
#include <string>
#include <vector>
#include <cstdlib>
#include <sstream>
#include <algorithm>
#include <cctype>
#include <utility>
#include <string.h>

bool compare (const std::pair<int, int>& l, const std::pair<int, int>& r) {
    return l.first < r.first;
}

//functor/unary predicate:
struct CompareFirst {
    CompareFirst(int val) : val_(val) {}
    bool operator()(const std::pair<int, int>& p) const {
        return (val_ == p.first);
    }
private:
    int val_;
};

int main() {
    char delims[] = ".,!?";
    char noise[] ="-'";

    //I'm assuming you've read the text from some file, and that information has been stored in a string. Or, the information is a string (like below):
    std::string input = "This is fun-\nny,  Mr.P and I've never seen\nthis ice-cream flavour\nbefore.Crazy eh?\n#\nThis is fun-\nny!  Mr.P and I've never seen\nthis ice-cream flavour\nbefore.Crazy eh?\n#\n";

    std::istringstream iss(input);
    std::string temp;

    //first split the string by #
    while(std::getline(iss, temp, '#')) {

        //find all the occurences of the hypens as it crosses lines, and remove the newline:
        std::string::size_type begin = 0;

        while(std::string::npos != (begin = temp.find('-', begin))) {
            //look at the character in front of the current hypen and erase it if it's a newline, if it is - remove it
            if (temp[begin+1] == '\n') {
                temp.erase(begin+1, 1);
            }
            ++begin;
        }

        //now, erase all the `noise` characters ("'-") as these count as these punctuation count as zero
        for (int i = 0; i < strlen(noise); ++i) {
            //this replaces all the hyphens and apostrophes with nothing
            temp.erase(std::remove(temp.begin(), temp.end(), noise[i]), temp.end());//since hyphens occur across two lines, you need to erase newlines
        }//at this point, everything is dandy for complete substitution

        //now try to remove any other delim chracters by replacing them with spaces
        for (int i = 0; i < strlen(delims); ++i) {
            std::replace(temp.begin(), temp.end(), delims[i], ' ');
        }

        std::vector<std::pair<int, int> > occurences;

        //initialize another input stringstream to make use of the whitespace
        std::istringstream ss(temp);

        //now use the whitespace to tokenize
        while (ss >> temp) {

            //try to find the token's size in the occurences
            std::vector<std::pair<int, int> >::iterator it = std::find_if(occurences.begin(), occurences.end(), CompareFirst(temp.size()));

            //if found, increment count by 1
            if (it != occurences.end()) {
                it->second += 1;//increment the count
            }
            //this is the first time it has been created. Store value, and a count of 1
            else {
                occurences.push_back(std::make_pair<int, int>(temp.size(), 1));
            }
        }

        //now sort and output:
        std::stable_sort(occurences.begin(), occurences.end(), compare);

        for (int i = 0; i < occurences.size(); ++i) {
            std::cout << occurences[i].first << " " << occurences[i].second << "\n";
        }
        std::cout << "\n";
    }

    return 0;
}

91 lines, and all vanilla C++98.

A rough outline of what I did is:

Since hyphens occur across two lines, find all hyphens and remove any newlines that follow them.
There are characters that don't add to the length of a word such as the legitimate hypenated words and the apostrophe. Find these and erase them as it makes tokenizing easier.
All the other remaining delimiters can now be found and replaced with whitespace. Why? Because we can use the whitespace to our advantage by using streams (whose default action is to skip whitespace).
Create a stream and tokenize the text via whitespace as per the previous.
Store the lengths of the tokens and their occurrences.
Sort the lengths of the tokens, and then output the token length and corresponding occurrences.

REFERENCES:

https://stackoverflow.com/a/5815875/866930

https://stackoverflow.com/a/12008126/866930

Splitting a String in C++ (using cin)

3 Answers3