0

I have a map in the following style :

mapa["01"]=2
mapa["111"]=3

that shows how frequently does each sub-string appear in a text (the text consists of 1s and 0s).

Now let's say that I want to output the 10 most frequently used sub-strings, I would need to go through all the keys in the map and get the ones with the biggest values (sorted by the value, descending). Any pseudo code or commands that I should use?

If it's of any use to you, I am working on this task and I got the all the values, I just don't know how to sort them.

A. Andevski
  • 437
  • 1
  • 5
  • 17
  • 1
    If this is the _only_ thing you wish to do with the data, you are storing it wrong. :) – Lightness Races in Orbit Apr 22 '14 at 11:00
  • @LightnessRacesinOrbit care to suggest any other ideas? Keep in mind that I'm still a beginner here, solving tasks that are for high school competitions. Also no additional libraries can be used. – A. Andevski Apr 22 '14 at 11:08
  • @LightnessRacesinOrbit I don't agree. This basically the common map-reduce solution to most frequent elements. `map(text)->(token,'1'), reduce(token,list)->(token,sum(list)`, and then choosing top k. Only noteable difference is usage of tree-based map rather than hashing solution (which is how MR is implemented) – amit Apr 22 '14 at 11:08
  • @amit: If this is the only lookup he wants to perform, then his keys and values are quite clearly the wrong way around. If not, then there's no problem. – Lightness Races in Orbit Apr 22 '14 at 11:18
  • 1
    @LightnessRacesinOrbit The map is his histogram created in previous step. He had to count the #occurances for each substring, and this is how it is typically done. – amit Apr 22 '14 at 11:19
  • @amit: You're not listening to me, but that's fine. – Lightness Races in Orbit Apr 22 '14 at 11:25

2 Answers2

1

You basically need to implement some top-k filtering algorithm using the comparator

x < y if and only if mapa[x] < mapa[y]

Choosing the top k out of a list efficiently is covered in the thread Store the largest 5000 numbers from a stream of numbers:

The basic idea is to create a min heap which always stored the top k elements encountered so far. When you are done, the heap contains the top K elements in the stream.

An alternative is using selection algorithm to find the 10th biggest element, and produce all the elements greater than it (a bit more work if there are duplicates).

Community
  • 1
  • 1
amit
  • 175,853
  • 27
  • 231
  • 333
0

See this duplicate , you basically iterate the map and use std::partial_sort on the values.

typedef std::pair<std::string, int> mypair;

struct IntCmp {
    bool operator()(const mypair &lhs, const mypair &rhs) {
        return lhs.second > rhs.second;
    }
};


void print10(const std::map<std::string,int> &mymap) {
    std::vector<mypair> myvec(mymap.begin(), mymap.end());
    assert(myvec.size() >= 10);
    std::partial_sort(myvec.begin(), myvec.begin() + 10, myvec.end(), IntCmp());

    for (int i = 0; i < 10; ++i) {
        std::cout << i << ": " << myvec[i].first 
            << "-> " << myvec[i].second << "\n";
    }
}
Community
  • 1
  • 1
quantdev
  • 23,517
  • 5
  • 55
  • 88
  • Works, but amit's solution is better (No need to copy map entries which are definitely not in the top 10) – MSalters Apr 22 '14 at 13:26