Count unique words in a string in C++

Question

I want to count how many unique words are in string 's' where punctuations and newline character (\n) separates each word. So far I've used the logical or operator to check how many wordSeparators are in the string, and added 1 to the result to get the number of words in string s.

My current code returns 12 as the number of word. Since 'ab', 'AB', 'aB', 'Ab' (and same for 'zzzz') are all same and not unique, how can I ignore the variants of a word? I followed the link: http://www.cplusplus.com/reference/algorithm/unique/, but the reference counts unique item in a vector. But, I am using string and not vector.

Here is my code:

#include <iostream>
#include <string>
using namespace std;

bool isWordSeparator(char & c) {

    return c == ' ' || c == '-' || c == '\n' || c == '?' || c == '.' || c == ','
    || c == '?' || c == '!' || c == ':' || c == ';';
}

int countWords(string s) {
    int wordCount = 0;

    if (s.empty()) {
    return 0;
    }

    for (int x = 0; x < s.length(); x++) {
    if (isWordSeparator(s.at(x))) {
            wordCount++;

    return wordCount+1;

int main() {
    string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";
    int number_of_words = countWords(s);

    cout << "Number of Words: " << number_of_words  << endl;

    return 0;

}

*But, I am using string and not vector.* -- The `std::unique` works on any sequence container, including `std::string`. — PaulMcKenzie, Feb 26 '17 at 05:06
You could tokenize the string first, see http://stackoverflow.com/questions/53849/how-do-i-tokenize-a-string-in-c then you need to lowercase all the words and add them to a set, to remove duplicates, and then return the size of the set. Give that a try and let us know if you got stuck in anything. — Hesham Attia, Feb 26 '17 at 05:06
Also [this is a better link](http://en.cppreference.com/w/cpp/algorithm/unique) — PaulMcKenzie, Feb 26 '17 at 05:10
Voted to close as too broad. I can imagine more than ten approaches. But it's good stuff for book treatment. Even Knuth wrote about PAMELA, I think it was, in his The Art of Computer Programming. — Cheers and hth. - Alf, Feb 26 '17 at 05:23
Note about `std::unique` it works on consecutive duplicates, so aa aa bb aa will treat the third aa as unique. Sort the list first or use a `std::set`. — user4581301, Feb 26 '17 at 06:13

Kelm · Answer 1 · 2017-02-26T05:19:55.613

3

You could create a set of strings, save the position of the last separator (starting with 0) and use substring to extract the word, then insert it into the set. When done just return the set's size.

You could make the whole operation easier by using string::split - it tokenizes the string for you. All you have to do is insert all of the elements in the returned array to the set and again return it's size.

Edit: as per comments, you need a custom comparator to ignore case for comparisons.

edited Feb 26 '17 at 05:19

answered Feb 26 '17 at 05:07

Kelm

967
5
14

2

The `std::set` has to know about case insensitive strings. That requires a secondary parameter, i.e. `std::set`. – PaulMcKenzie Feb 26 '17 at 05:09

Ap31 · Accepted Answer · 2017-02-26T06:34:22.267

3

What you need to make your code case-insensitive is tolower().
You can apply it to your original string using std::transform:

std::transform(s.begin(), s.end(), s.begin(), ::tolower);

I should add however that your current code is much closer to C than to C++, perhaps you should check out what standard library has to offer.

I suggest istringstream + istream_iterator for tokenizing and either unique_copy or set for getting rid of the duplicates, like this: https://ideone.com/nb4BEH

edited Feb 26 '17 at 06:34

answered Feb 26 '17 at 05:13

Ap31

3,244
1
18
25

1

You suggested using `istringstream` and `istream_iterator` for tokenizing. Since I am a newbie in C++, Can you share any online resource from which I can learn it easily? – Naz Islam Feb 26 '17 at 05:48
2

@Naz-AlIslam well, [cppreference.com](http://en.cppreference.com/w/Main_Page) comes to mind as a source of reference, and [here is the question](http://stackoverflow.com/questions/1178430/what-is-a-good-book-to-learn-stl) that suggests some books. I'm not sure about "easily" though:) – Ap31 Feb 26 '17 at 06:02
1

For now, I've added the link to working example to my answer. Notice that I'm using a lambda to combine your `isWordSeparator` with `tolower` - if you only considered whitespace to be the separator, there would be no need for that. – Ap31 Feb 26 '17 at 06:37

zett42 · Answer 3 · 2017-02-26T13:30:52.483

While splitting the string into words, insert all words into a std::set. This will get rid of the duplicates. Then it's just a matter of calling set::size() to get the number of unique words.

I'm using the boost::split() function from the boost string algorithm library in my solution, because is almost standard nowadays. Explanations in the comments in code...

#include <iostream>
#include <string>
#include <set>
#include <boost/algorithm/string.hpp>
using namespace std;

// Function suggested by user 'mshrbkv':
bool isWordSeparator(char c) {
    return std::isspace(c) || std::ispunct(c);
}

// This is used to make the set case-insensitive.
// Alternatively you could call boost::to_lower() to make the
// string all lowercase before calling boost::split(). 
struct IgnoreCaseCompare { 
    bool operator()( const std::string& a, const std::string& b ) const {
        return boost::ilexicographical_compare( a, b );
    }
};

int main()
{
    string s = "ab\nAb!aB?AB:ab.AB;ab\nAB\nZZZZ zzzz Zzzz\nzzzz";

    // Define a set that will contain only unique strings, ignoring case.
    set< string, IgnoreCaseCompare > words;

    // Split the string by using your isWordSeparator function
    // to define the delimiters. token_compress_on collapses multiple
    // consecutive delimiters into only one. 
    boost::split( words, s, isWordSeparator, boost::token_compress_on );

    // Now the set contains only the unique words.
    cout << "Number of Words: " << words.size() << endl;
    for( auto& w : words )
        cout << w << endl;

    return 0;
}

Demo: http://coliru.stacked-crooked.com/a/a3b51a6c6a3b4ee8

Thanks @zett42 . But when I compile it using the g++ compiler, it shows error: `fatal error: 'boost/algorithm/string.hpp' file not found`. How can I solve it? — Naz Islam, Feb 26 '17 at 06:11
Download boost source from http://www.boost.org/ , extract it to some folder and add this folder to the "include" path of g++. (The string library is a header-only library so you don't have to build boost!) — zett42, Feb 26 '17 at 13:26
Plugging in boost for such an easy tasks looks like an overkill to me. String splitting and case-insensitive comparison can be easily done without any third-party libraries. — mshrbkv, Feb 28 '17 at 11:49

mshrbkv · Answer 4 · 2017-02-26T06:27:32.917

First of all I'd suggest rewriting isWordSeparator like this:

bool isWordSeparator(char c) {
    return std::isspace(c) || std::ispunct(c);
}

since your current implementation doesn't handle all the punctuation and space, like \t or +.

Also, incrementing wordCount when isWordSeparator is true is incorrect for example if you have something like ?!.

So, a less error-prone approach would be to substitute all separators by space and then iterate words inserting them into an (unordered) set:

#include <iterator>
#include <unordered_set>
#include <algorithm>
#include <cctype>
#include <sstream>

int countWords(std::string s) {
    std::transform(s.begin(), s.end(), s.begin(), [](char c) { 
        if (isWordSeparator(c)) {
            return ' ';
        }

        return std::tolower(c);
    });

    std::unordered_set<std::string> uniqWords;

    std::stringstream ss(s);
    std::copy(std::istream_iterator<std::string>(ss), std::istream_iterator<std::string(), std::inserter(uniqWords));

    return uniqWords.size();
}

score 0 · Answer 5 · edited Mar 01 '17 at 06:57

0

You can consider SQLite c++ wrapper

edited Mar 01 '17 at 06:57

Ria

10,237
3
33
60

answered Feb 28 '17 at 00:54

Amit Vujic

1,632
1
24
32

Count unique words in a string in C++

5 Answers5