Determining most freq char element in a vector?

Question

I am trying to determine the most frequent character in a vector that has chars as its elements.

I am thinking of doing this:

looping through the vector and creating a map, where a key would be a unique char found in the vector. The corresponding value would be the integer count of the frequency of that char.
After I have gone through all elements in the vector, the map will contain all character frequencies. Thus I will then have to find which key had the highest value and therefore determine the most frequent character in the vector.

This seems quite convoluted though, thus I was wondering if someone could suggest if this method would be considered 'acceptable' in terms of performance/good coding

Can this be done in a better way?

What you've described is definitely the normal way to solve this. Is that what you're asking? — hcarver, Aug 21 '12 at 06:26
How big is the maximum vector size? What characters are allowed in the vector? — LeleDumbo, Aug 21 '12 at 06:26
Lots of answers here: http://stackoverflow.com/questions/1991984/algorithm-for-finding-the-number-which-appears-the-most-in-a-row-c — Ray Toal, Aug 21 '12 at 06:26
Convoluted? That is the most straight-forward obvious way to do this task. It should be your first option, and if, after profiling, it is determined to be not efficient enough, then you can look at other options. — Benjamin Lindley, Aug 21 '12 at 06:27
@Hbcdev I wasn't sure if this was the common solution as it just seemed the most intuitive solution I could come up with, I wasn't sure if it would be considered unacceptably inefficient in industry though.... — rrazd, Aug 21 '12 at 06:28

score 6 · Accepted Answer · answered Aug 21 '12 at 06:34

6

If you are only using regular ascii characters, you can make the solution a bit faster - instead of using a map, use an array of size 256 and count the occurrences of the character with a given code 'x' in the array cell count[x]. This will remove an logarithm(256) from your solution and thus will make it a bit faster. I do not think much more can be done with respect to optimization of this algorithm.

answered Aug 21 '12 at 06:34

Ivaylo Strandjev

69,226
18
123
176

1

Plus this solution is cache-friendly. – Alexander Chertov Aug 21 '12 at 06:35
3

Beware: a `char` may be **signed** in which case using it as an index is... surprising. – Matthieu M. Aug 21 '12 at 07:46

score 2 · Answer 2 · edited May 23 '17 at 10:34

Sorting a vector of chars and then iterating through that looking for the maximum run lengths seems to be 5 times faster than using the map approach (using the fairly unscientific test code below acting on 16M chars). On the surface both functions should perform close to each other because they execute with O(N log N). However the sorting method probably benefits from branch prediction and move semantics of the in-place vector sort.

The resultant output is:

Most freq char is '\334', appears 66288 times.
usingSort() took 938 milliseconds
Most freq char is '\334', appears 66288 times.
usingMap() took 5124 milliseconds

And the code is:

#include <iostream>
#include <map>
#include <vector>
#include <chrono>

void usingMap(std::vector<char> v)
{
    std::map<char, int> m;
    for ( auto c : v )
    {
        auto it= m.find(c);
        if( it != m.end() )
            m[c]++;
        else
            m[c] = 1;
    }

    char mostFreq;
    int count = 0;
    for ( auto mi : m )
        if ( mi.second > count )
        {
            mostFreq = mi.first;
            count = mi.second;
        }

    std::cout << "Most freq char is '" << mostFreq << "', appears " << count << " times.\n";
}

void usingSort(std::vector<char> v)
{
    std::sort( v.begin(), v.end() );
    char currentChar = v[0];
    char mostChar = v[0];
    int currentCount = 0;
    int mostCount = 0;
    for ( auto c : v )
    {
        if ( c == currentChar )
            currentCount++;
        else
        {
            if ( currentCount > mostCount)
            {
                mostChar = currentChar;
                mostCount = currentCount;
            }
            currentChar = c;
            currentCount = 1;
        }
    }

    std::cout << "Most freq char is '" << mostChar << "', appears " << mostCount << " times.\n";
}

int main(int argc, const char * argv[])
{
    size_t size = 1024*1024*16;
    std::vector<char> v(size);
    for ( int i = 0; i < size; i++)
    {
        v[i] = random() % 256;
    }

    auto t1 = std::chrono::high_resolution_clock::now();
    usingSort(v);
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout
        << "usingSort() took "
        << std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
        << " milliseconds\n";
    auto t3 = std::chrono::high_resolution_clock::now();
    usingMap(v);
    auto t4 = std::chrono::high_resolution_clock::now();
    std::cout
        << "usingMap() took "
        << std::chrono::duration_cast<std::chrono::milliseconds>(t4-t3).count()
        << " milliseconds\n";
    return 0;
}

My wild guess is that the sorting approach benefits from fewer heap allocations and better cache locality. — Alexander Chertov, Aug 21 '12 at 09:57
Anyway, I have a question similar to this. Would appreciate it if you take a look:http://stackoverflow.com/questions/21693841/getting-the-most-occurrence-value-in-a-vectorcvvec3b-with-c — rockinfresh, Feb 11 '14 at 09:39

Determining most freq char element in a vector?

2 Answers2

Linked