Removing duplicate elements of a vector of pairs based on individual elements inside the pair

Question

So I need to find some most common used words in a file.

I have a vector<pair<string, int> > wordList which keeps track of every word in the file along with its frequency. This part works fine.

The problem is, the output shows multiple versions of each word. This is because the way I calculated it was:

load all words into the vector with a frequency of 1
Go through again and if the word shows up twice, increase its count
The part that I need help on, which is to delete multiple entries of the same word.
```
for(int j = 0; j < wordList.size(); j++)
{
```

This is my current approach. This function tallys up all the word. the problem is that the line wordList.erase that's inside the for loop produces an out of bounds error, so I cant remove the duplicate entry that way. I also tried the unique() approach, but that doesn't seem to work it only removes some entries.

What is the most efficient way to reduce a vector of pairs to only unique elements?

Yes, however I need this list printed in descending order and if I recall correctly, a map cannot be sorted based on certain values inside it. Am I wrong? — , Sep 26 '16 at 21:54
when you erase elements from a vector inside a loop you have to consider that the size is changing. If you erase element i then the next element is not i+1 but i. — 463035818_is_not_an_ai, Sep 26 '16 at 21:56
So what should I change in my loop? Doesn't referencing wordList.size() automatically account for size change? — , Sep 26 '16 at 21:58
A map is sorted by key, you're right. I don't know what other requirements you have, but I would consider constructing the map , then transcribing it into a map for printout. — Beta, Sep 26 '16 at 21:59
@ConnorSchwinghammer - *Yes, however I need this list printed in descending order* -- `std::map>`. — PaulMcKenzie, Sep 26 '16 at 22:00
How exactly would a map help resolve the duplicate issue? Is there a way to efficiently remove duplicate values? — , Sep 26 '16 at 22:01
A `std::map` does not store duplicate keys. The words are the `key` element in the map. It's as simple as `mymap["the_word"]++;` adds 1 to the number of times "the_word" is detected. It doesn't add another entry into the map if there is one already with the key `"the_word"`. — PaulMcKenzie, Sep 26 '16 at 22:02
@PaulMcKenzie Oh wow, I just implemented that and it worked perfectly. Thank you — , Sep 26 '16 at 22:49

score 0 · Answer 1 · answered Sep 26 '16 at 22:00

Your having problems because you are removing from the vector whilst you are iterating over the vector, this alters the sizes of the lists and your i++ and j++ can jump over entries and you'll miss some

You might want to consider using an std::set or performing a find() before adding it to the vector to determine if the vector already contains the word

score 0 · Answer 2 · edited May 23 '17 at 12:00

Try this:

for(int j = 0; j < wordList.size(); j++) {
    for(int k = j+1; k < wordList.size(); /*no increment*/) {
        if(wordList[j].first == wordList[k].first)
        {
            wordList[j].second++;
            wordList.erase(wordList.begin()+k);
        } else {
            k++;   // increment only if no element was erased !
        }
    }
}

When you erase inside a loop you have to consider that after erasing element k, the next one is k, not k+1, ie you have to increment only if no element was erased. Without knowing the input it is hard to say why the out of bounds error appeared, but that was the cause.

Also you dont have to check each pair twice. The second loop can start at j+1.

PS: as it was mentioned in comments to your question I would also suggest you to use a std::map instead. Even if you need a vector afterwards (see e.g here).

score 0 · Accepted Answer · answered Sep 26 '16 at 22:14

You may use something like:

std::map<std::string, std::size_t>
compute_frequency(const std::vector<std::string>& words)
{
    std::map<std::string, std::size_t> res;

    for (const auto& word : words) {
        ++res[word];
    }
    return res;
}


void test(const std::vector<std::string>& words)
{
    const auto m = compute_frequency();
    std::vector<std::pair<std::string, std::size_t>> v(m.begin(), m.end());

    auto myless = [](const auto& lhs, const auto& rhs) {
        //return lhs.first > rhs.first;   // by decreasing word
                                          // (then you may add the comp in map directly)
        return lhs.second > rhs.second;   // by decreasing frequency
    };
    std::sort(v.begin(), v.end(), myless);
    for (const auto& p : v) {
        std::cout << p.first << " appears " << p.second << std::endl;
    }

);

Thank you, I did use a map and I got it to work. Of course I've now run into 4 other problems in other parts of the program, but I do have a concise list of words and their frequencies — , Sep 26 '16 at 22:36

Removing duplicate elements of a vector of pairs based on individual elements inside the pair

3 Answers3