2

I am trying to determine the most frequent character in a vector that has chars as its elements.

I am thinking of doing this:

  • looping through the vector and creating a map, where a key would be a unique char found in the vector. The corresponding value would be the integer count of the frequency of that char.
  • After I have gone through all elements in the vector, the map will contain all character frequencies. Thus I will then have to find which key had the highest value and therefore determine the most frequent character in the vector.

This seems quite convoluted though, thus I was wondering if someone could suggest if this method would be considered 'acceptable' in terms of performance/good coding

Can this be done in a better way?

Ivaylo Strandjev
  • 69,226
  • 18
  • 123
  • 176
rrazd
  • 1,741
  • 2
  • 32
  • 47
  • What you've described is definitely the normal way to solve this. Is that what you're asking? – hcarver Aug 21 '12 at 06:26
  • How big is the maximum vector size? What characters are allowed in the vector? – LeleDumbo Aug 21 '12 at 06:26
  • Lots of answers here: http://stackoverflow.com/questions/1991984/algorithm-for-finding-the-number-which-appears-the-most-in-a-row-c – Ray Toal Aug 21 '12 at 06:26
  • 2
    Convoluted? That is the most straight-forward obvious way to do this task. It should be your first option, and if, after profiling, it is determined to be not efficient enough, then you can look at other options. – Benjamin Lindley Aug 21 '12 at 06:27
  • @Hbcdev I wasn't sure if this was the common solution as it just seemed the most intuitive solution I could come up with, I wasn't sure if it would be considered unacceptably inefficient in industry though.... – rrazd Aug 21 '12 at 06:28

2 Answers2

6

If you are only using regular ascii characters, you can make the solution a bit faster - instead of using a map, use an array of size 256 and count the occurrences of the character with a given code 'x' in the array cell count[x]. This will remove an logarithm(256) from your solution and thus will make it a bit faster. I do not think much more can be done with respect to optimization of this algorithm.

Ivaylo Strandjev
  • 69,226
  • 18
  • 123
  • 176
2

Sorting a vector of chars and then iterating through that looking for the maximum run lengths seems to be 5 times faster than using the map approach (using the fairly unscientific test code below acting on 16M chars). On the surface both functions should perform close to each other because they execute with O(N log N). However the sorting method probably benefits from branch prediction and move semantics of the in-place vector sort.

The resultant output is:

Most freq char is '\334', appears 66288 times.
usingSort() took 938 milliseconds
Most freq char is '\334', appears 66288 times.
usingMap() took 5124 milliseconds

And the code is:

#include <iostream>
#include <map>
#include <vector>
#include <chrono>

void usingMap(std::vector<char> v)
{
    std::map<char, int> m;
    for ( auto c : v )
    {
        auto it= m.find(c);
        if( it != m.end() )
            m[c]++;
        else
            m[c] = 1;
    }

    char mostFreq;
    int count = 0;
    for ( auto mi : m )
        if ( mi.second > count )
        {
            mostFreq = mi.first;
            count = mi.second;
        }

    std::cout << "Most freq char is '" << mostFreq << "', appears " << count << " times.\n";
}

void usingSort(std::vector<char> v)
{
    std::sort( v.begin(), v.end() );
    char currentChar = v[0];
    char mostChar = v[0];
    int currentCount = 0;
    int mostCount = 0;
    for ( auto c : v )
    {
        if ( c == currentChar )
            currentCount++;
        else
        {
            if ( currentCount > mostCount)
            {
                mostChar = currentChar;
                mostCount = currentCount;
            }
            currentChar = c;
            currentCount = 1;
        }
    }

    std::cout << "Most freq char is '" << mostChar << "', appears " << mostCount << " times.\n";
}

int main(int argc, const char * argv[])
{
    size_t size = 1024*1024*16;
    std::vector<char> v(size);
    for ( int i = 0; i < size; i++)
    {
        v[i] = random() % 256;
    }

    auto t1 = std::chrono::high_resolution_clock::now();
    usingSort(v);
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout
        << "usingSort() took "
        << std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
        << " milliseconds\n";
    auto t3 = std::chrono::high_resolution_clock::now();
    usingMap(v);
    auto t4 = std::chrono::high_resolution_clock::now();
    std::cout
        << "usingMap() took "
        << std::chrono::duration_cast<std::chrono::milliseconds>(t4-t3).count()
        << " milliseconds\n";
    return 0;
}
Community
  • 1
  • 1
Phillip Ngan
  • 15,482
  • 8
  • 63
  • 79