0

I am learning HashSet-s and I can't quite understand how they build their tables. Every article I read talks about % hash set function, which seems to imply that should I add several small numbers (ints) to a HasSet (say 1, 7, 14) and then add a large one (say 2732780) - the table for the HashSet will immediately grow to an enormous size with tons of empty cells. Am I correct? If not, in what form are numbers/elements stored then? Where can I read about this in detail?

UPDATE: So to clarify: say my hash function is x = (input / 10), y = (input % 10), this means in my example above, while you are adding 1, 7 and 14 the HashSet build the table like so: enter image description here

The question is, what will happen if I add 2732780? The table should explode right? And have tons of empty numbers?

FZs
  • 16,581
  • 13
  • 41
  • 50
cubrman
  • 884
  • 1
  • 9
  • 22
  • Which programming language and specific class (include the namespace/library) are you talking about? – Progman Aug 16 '20 at 06:50
  • Does this answer your question? [How does a Java HashMap handle different objects with the same hash code?](https://stackoverflow.com/questions/6493605/how-does-a-java-hashmap-handle-different-objects-with-the-same-hash-code) – Progman Aug 16 '20 at 06:59
  • No, I am not interested in what happens when two objects with the same hash code are added. I updated the question. – cubrman Aug 16 '20 at 07:11
  • Your example shows two functions, but there should be only one function for the "hash function". Please [edit] your example to have only one function as the hash function (most likely its `input % 10`). – Progman Aug 16 '20 at 07:20
  • @Progman the purpose of the Stackoverflow platform is to be publicly wrong and to be publicly corrected. Precisely this happened - I won't edit anything. – cubrman Aug 16 '20 at 08:11

2 Answers2

3

Hashsets use one hash function. This turns any datatype into an int. This int is then placed into a bucket. This is done by using % with the number of available buckets.

Say that it's x % 10. Then it'll look like this:

|input   |hashcode    |bucket
|1       |     1      |1
|7       |     7      |7
|14      |     14     |4
|2732780 |    2732780 |0

In the case that there are two entries with the same bucket, then a list is kept and the lookup is linear within that bucket. The implementation will grow buckets as needed.

See https://learn.microsoft.com/en-us/dotnet/api/system.object.gethashcode?view=netcore-3.1#System_Object_GetHashCode

Jason
  • 1,505
  • 5
  • 9
  • Marking your answer as an answer because it's easier to understand. – cubrman Aug 16 '20 at 08:10
  • I think you have an error in your answer: hashed for 14 should be 4 and for 2732780 should be 0, the bucket values should be the same as input. Am I wrong? – cubrman Aug 16 '20 at 08:22
  • I wanted to change it to distinguish between hashcode and bucket. The hashcode for an int is itself. The bucket size is determined by the number of elements in the set to save space. But I will update the column to hashcode and there was an error when I moved 14 – Jason Aug 16 '20 at 08:27
1

In Java a HashSet is implemented using a HashMap (from an algorithm point of view, there's no reason why this should be different in C#).

This article explains how a HashMap works quite well: https://www.geeksforgeeks.org/internal-working-of-hashmap-java/

The key point is that there's a notion of map capacity. This is an array holding the map buckets. The computation which index to choose is:

index = hashCode(key) & (n-1)

with n being the capacity. This means that as you insert more entries into the set / map, the implementation will grow and re-hash its entries.

aeberhart
  • 744
  • 1
  • 4
  • 15