0

This is more of a theoretical question. In addition, I have to admit, that I did some rather sophisticated performance tests about this some time ago, but I cannot find the source code anymore.

Some words about the type of application: I am focussing on scenarios with really big numbers of entries, from 50,000 up to some million, while memory consumption does not really matter.

I fully understand the basic concept of a hash data structure, why it generally has constant access times, and why rehashing is required at some point.

Any possible key is mapped to a certain slot in the hash structure. Of course, collisions may occur, resulting in multiple keys being mapped to the same slot. This is where implementation details come into play. As far as I know, there is some kind of logic using the "next" slot if the initially assigned slot is occupied.

My feeling is, there has to be a weak spot somewhere, performance-wise. Given a really big QHash, filled up just below its capacity, of which keys are then removed randomly, while new keys are added (without ever increaing the total number of stored keys, making sure it does not rehash): I would think, this has to lead to severe performance degration in the long term.

Filling up a QHash just below its capacity, with random values, should result in a considerable amount of key collisions. The lookup of a key affected from collisions requires multiple slots to be inspected, resulting in performance penalties. Removing keys and adding new random keys instead should make things even worse: Contiguous sequences of colliding keys will be fragmented. Collisions occupy 'foreign' slots, forcing a key actually mapped to this slot to be stored somewhere else. This slot might still be free'd later...

To make it short, I would expect, for the given scenario (perfoming deletes and inserts on a QHash which is always kept just below its capacity), performance should degrade in the long run, either because of lookup times are increasing due to increasing disorder, or because of periodical reordering.

However, I had taken some efforts to show this performance degration once (as I said, I cannot find this project anymore, so I stand here barehanded, I'm afraid), and I could not find any.

Is there a special magic in place regarding QHash handling collisions I am not aware of?

bolov
  • 72,283
  • 15
  • 145
  • 224
philipp
  • 1,745
  • 1
  • 14
  • 25
  • 1
    hash performance is all about what data you put in it. Collect your real world data and benchmark it. – xaxxon Oct 19 '16 at 08:28
  • maybe this can help you a bit: http://stackoverflow.com/questions/2196995/is-there-any-advantage-of-using-map-over-unordered-map-in-case-of-trivial-keys I guess you should look for genereal hashmaps etc. instead of only `QHash` – Hayt Oct 19 '16 at 08:31
  • I don't think you really understand hash maps. In particular, your idea about "colliding keys and fragmenting" has the rather nasty property that finding an element may be O(1), but proving its absence is O(N). Because every value could end in in every slot, you must check every slot to determine whether that value is in the hash map. To have O(1) lookup even when the lookup fails, every value must hash to at most O(1) keys. – MSalters Oct 19 '16 at 10:14

1 Answers1

1

tl;dr;

Is there a special magic in place regarding QHash handling collisions I am not aware of

There is no magic. You just misunderstood how hash maps work.

(*) I will treat the concept of hash map, not the specific implementation QHash. And although there are several approaches to handling collisions, what I am describing here is the most common pattern. It is used, among others, by c++ std::unordered_map, Qt QHash, java HashMap.


Your terminology with "slots" it's a bit confusing. I first though that by slot you mean "bucket", but I now think you mean element of a bucket.

So in a hash, colliding keys are stored in a bucket. This can be any container, from vector to list.

Finding a bucket is O(1) and finding a key inside a bucket is O(k), where k is the bucket's length. So key access is constant in best case and linear in worst case.

You seem to assume that the number of buckets somehow increases when the hash map fills it's capacity. Well, there is no such thing as a capacity for a hash map (like there is for a vector for instance). So the situation that you describe: "having a hash with the capacity of let's say 100 and in the worst case all 100 elements collide and are stored in the same bucket" can't happen.

For a hash map you have a measure called "load factor" which is the average number of elements per bucket (size / bucket_count). The hash will increase the number of buckets (and recompute the hashes of every element, repositioning them) when the load factor exceeds a threshold max load factor. Performance is assured first and foremost by the quality of the hash function which must assure that keys are uniformly spread across all buckets. But no matter how good the hash function is, you can still have situations where some buckets are considerably larger then the rest. The fail-safe for that is the mentioned the max load factor.

By consequence, the max load factor achieves two purposes:

  • it acts as a "logical capacity" if you will: it makes the hash naturally increase the buckets count in the scenario that elements are added to the hash uniformly, making all the buckets too large.

  • it acts as a "fail safe" for the hash function. It increases the buckets count in the (rare) scenario that you have multiple keys colliding on a small subset of buckets.

bolov
  • 72,283
  • 15
  • 145
  • 224
  • 1
    While your "bucket" concept is a valid approach to collisions, so is rehashing. You *both* misunderstand how hash maps work, by assuming there is only one way to handle collisions. That said, `QHash` specifically is documented to use buckets. – MSalters Oct 19 '16 at 10:06
  • @MSalters aren't we talking about the same thing? When load factor exceeds a threshold rehashing occurs. – bolov Oct 19 '16 at 10:42
  • I guess it depends on whether you see a slot as a bucket with fixed capacity 1. In the simplest of hash tables, you immediately increase the number of slots as soon as you have a collision. Buckets are one way to avoid such aggressive rehashing. Their downside is poor locality of reference. Another is to use N adjacent slots for colliding entries. This too delays rehashing, but now you need to rehash after N+1 collisions for a single insert. – MSalters Oct 19 '16 at 10:48
  • @MSalters Thank you. I haven't given too much thought into it, but it is definitely interesting to see different attempts at handling collisions. – bolov Oct 19 '16 at 10:56