3

I know I don't have a working code/minimal, but I am not asking for more help in the code. I am going to try to summarize as much as possible. The test runs for 1000 times trying to insert 50 persons into the table. The trial randomly generates keys based on getRandomPersonalNumber.

The linear probing function returns if there are any collisions, updates the index if needed, and searches if the keys match the index. Now in the result, the only thing that seems weird is Table 1 I have asked some friends about the result, and they said that perhaps modulo 100 is doing something and that why I am getting the high collusion result in Table 1.

I am concerned that it is something perhaps with my calculations, but then again, it is only happening with 100 modulo, so I don't know can I calculate precisely the amount of collision without relying on the code some math perhaps? Lastly, is there a way to calculate a good middle ground for the amount of storage vs the amount of collusion(load factor)?

typedef struct
{
    struct Bucket *table; 

} HashTable;

static int hash(Key key, int tablesize)
{
    return (key % tablesize);
}

static int addPersonsToTable(HashTable *htable, const Person *personArray, int amount)
{
    int collissions = 0, i;
    for (i = 0; i < amount; i++)
    {
        int key = personArray[i].personalNumber;
        collissions += insertElement(htable, key, personArray[i]);
    }
    return collissions;
}

static int getRandomPersonalNumber(void)
{
    int day = rand() % 30 + 1; 
    int month = rand() % 12 + 1;
    int year = rand() % 60 + 40;
    return day + 100 * month + 10000 * year;
}

int insertElement(HashTable *htable, const Key key, const Value value)
{
    int coll;
    int index = hash(key, htable->size);
    coll = linearProbe(htable, key, &index);
    if (coll ==0 || index > -1)
    {
        htable->table[index].key = key;
        htable->table[index].value = value;
    }
    else
    {

    }

    return coll;
}

Table test.

-- Summary ----------------------
Average collisions on 1000 runs. Inserted 50 persons.
Table 1 (size 100) - average number of collisions: 516 - load factor: 0.50
Table 2 (size 150) - average number of collisions: 26 - load factor: 0.33
Table 3 (size 200) - average number of collisions: 68 - load factor: 0.25
Table 4 (size 250) - average number of collisions: 12 - load factor: 0.20
Table 5 (size 300) - average number of collisions: 26 - load factor: 0.17
Table 6 (size 350) - average number of collisions: 7 - load factor: 0.14
Table 7 (size 400) - average number of collisions: 16 - load factor: 0.13
Table 8 (size 450) - average number of collisions: 5 - load factor: 0.11
----------------------------------
chqrlie
  • 131,814
  • 10
  • 121
  • 189
jacob
  • 31
  • 1

1 Answers1

3

This is a natural result of how hashtables work; even if your hashes are highly unique, when mapped into a small range of possible values, there will be a substantial number of collisions. Assuming your hash function is perfectly random, the expected load factor is usedSpace/availableSpace.

That said, you seem to be under the mistaken impression that your .5 load factor is inefficient. This load factor is perfectly fine; Java, for example, has a default load factor of .75! Linear searches on very few elements are highly efficient, so there isn't a real performance penalty as long as the number of elements in each hash location is low.

What's more concerning than the total number of hash collisions, is if there are a high number of hash collisions in the same place, i.e. your hash is not random enough. The more elements are stuffed into a single hash location, the more linear your search time becomes; this is why guarantees on your hash function are more important than guarantees on hash collisions.

EDIT: while the above is correct, the abnormally high collisions on the table size of 100 (which I misread: oops) are due to the modulo being a factor of the month and year multiples: see the comment below.

dominicm00
  • 1,490
  • 10
  • 22
  • 1
    what would you say is the most optimal table size for 50 elements do I do more tests where do I draw the line on the load factor vs the size of the table. For the other matter I think you are right it is nor random enough making the hash key very similar – jacob May 29 '21 at 22:06
  • 2
    @jacob my apologies; I misunderstood the problem you were having. The reason the collisions are so high on a table size of 100 is because `month` and `year` are multiples of 100 and 10000, respectively, and mod 100 of either of those is 0! That means that only `day` is contributing to your hash location, meaning your effective table size is only 30! To fix this, you should multiply `month` and `year` by prime numbers. – dominicm00 May 29 '21 at 22:17
  • 1
    A minimal perfect hash function would take 50 spots, but usually 2/3 to 3/4, 1/e, maximum load factor is what is used in practice. I've heard that you can have more in linear probing. Your focus should probably be to make the hash function more random. – Neil May 29 '21 at 22:20