3

I want to sort a huge array, say 10^8 entries of type X with at most N different keys, where N is ~10^2. Because I don't know the range or spacing of the elements, count sort is not an option. So my best guess so far is to use a hash map for the counts like so

std::unordered_map< X, unsigned > counts;
for (auto x : input)
    counts[x]++;

This works ok-ish and is ~4 times faster than 3-way quicksort, but I'm a nervous person and it's still not fast enough.

I wonder: am I missing something? Can I make better use of the fact that N is known in advance? Or is it possible to tune the hash map to my needs?


EDIT An additional pre-condition is that the input sequence is badly sorted and the frequency of the keys is about the same.

stgatilov
  • 5,333
  • 31
  • 54
user1978011
  • 3,419
  • 25
  • 38
  • Maybe call [reserve(N)](http://www.cplusplus.com/reference/unordered_map/unordered_map/reserve/) before starting to add to the hash table? – IVlad Jul 21 '15 at 10:35
  • Are the keys consecutive? Do you know only their number in advance, or also their values? – dlask Jul 21 '15 at 10:37
  • @IVlad I have tried that. Makes virtually no difference for my map implementation, however. – user1978011 Jul 21 '15 at 12:52
  • @dlask I only know the number of different keys. The numbers may be non-consecutive and the key values are not known. – user1978011 Jul 21 '15 at 12:53
  • 1
    `for (auto const& x : input)` would avoid copying it should be faster for you – coincoin Jul 21 '15 at 13:26
  • Do you know anything at all about the elements (other than there are ~ 100 distinct ones)? – Rob Jul 21 '15 at 13:45
  • Note that your code does not *sort* the array, it only counts *frequences* of the elements. I guess it is enough for you. – stgatilov Jul 21 '15 at 13:47
  • 1
    You can try a hash table more optimized than `std::unordered_map`, such as Google's [`dense_hash_map`](http://sparsehash.googlecode.com/svn/trunk/doc/dense_hash_map.html). It also sometimes helps to use better hash function than `std::hash`. – Ilya Popov Jul 21 '15 at 15:04

3 Answers3

2

STL implementations are often not perfect in terms of performance (no holy wars, please).

If you know a guaranteed and sensible upper on the number of unique elements (N), then you can trivially implement your own hash table of size 2^s >> N. Here is how I usually do it myself:

int size = 1;
while (size < 3 * N) size <<= 1;
//Note: at least 3X size factor, size = power of two
//count = -1 means empty entry
std::vector<std::pair<X, int>> table(size, make_pair(X(), -1));
auto GetHash = [size](X val) -> int { return std::hash<X>()(val) & (size-1); };

for (auto x : input) {
  int cell = GetHash(x);
  bool ok = false;
  for (; table[cell].second >= 0; cell = (cell + 1) & (size-1)) {
    if (table[cell].first == x) { //match found -> stop
      ok = true;
      break;
    }
  }
  if (!ok) {             //match not found -> add entry on free place
    table[cell].first = x;
    table[cell].second = 0;
  }
  table[cell].second++;  //increment counter
}

On MSVC2013, it improves time from 0.62 secs to 0.52 secs compared to your code, given that int is used as type X.

Also, we can choose a faster hash function. Note however, that the choice of hash function depends heavily on the properties of the input. Let's take Knuth's multiplicative hash:

auto GetHash = [size](X val) -> int { return (val*2654435761) & (size-1); };

It further improves time to 0.34 secs.

As a conclusion: do you really want to reimplement standard data structures to achieve a 2X speed boost?

Notes: Speedup may be entirely different on another compiler/machine. You may have to do some hacks if your type X is not POD.

Community
  • 1
  • 1
stgatilov
  • 5,333
  • 31
  • 54
2

Counting sort really would by best, but isnt applicable due to unknown range or spacing.

Seems to be easily parallelized with fork-join, e.g. boost::thread.

You could also try a more efficient, handrolled hashmap. Unorded_map typically uses linked lists to counter potentially bad hash functions. The memory overhead of linked lists may hurt performance if the hashtable doesnt fit into L1 cache. Closed Hashing may use less memory. Some hints for optimizing:

  • Closed Hashing with linear probing and without support for removal
  • power of two sized hashtable for bit shifting instead of modulo (division requires multiple cycles and there is only one hardware divider per core)
  • Low LoadFactor (entries through size) to minimize collisions. Thats a tradeof between memory usage and number of collisions. A LoadFactor over 0.5 should be avoided. A hashtable-size of 256 seems suitable for 100 entries.
  • cheapo hash function. You havent shown the type of X, so perhaps a cheaper hash function could outweigh more collisions.
Markus Kull
  • 1,471
  • 13
  • 16
0

I would look to store items in a sorted vector, as about 100 keys, would mean inserting into the vector would only occur 1 in 10^6 entries. Lookup would be processor efficient bsearch in vector

mksteve
  • 12,614
  • 3
  • 28
  • 50