6

I've implemented binary search, linear search and a hash table to compare each time complexity. The problem is that somehow, my hash table is much slower than the binary search when I measure time to find prime numbers. Below is my code:

// Make the hash table 20 times the number of prime numbers
HashTable::HashTable(std::vector<int> primes)
{
    int tablesize = primes.size() * 20;
    table = new std::list<int>[tablesize];
    size = tablesize;
    for (auto &prime : primes)
        this->insert(prime);
}

// Hash function
int HashTable::hash(int key)
{
    return key % size;
}

// Finds element
int HashTable::find(int key)
{
    // Get index from hash
    int index = hash(key);

    // Find element
    std::list<int>::iterator foundelement = std::find(table[index].begin(), table[index].end(), key);


    // If element has been found return index
    // If not, return -1
    if (foundelement != table[index].end())
        return index;
    else
        return -1;
}



// Adds element to hashtable
void HashTable::insert(int element)
{
    // Get index from hash and insert the element
    int index = hash(element);
    table[index].push_back(element);
}

HashTable.h

#ifndef HASHTABLE_H
#define HASHTABLE_H

#include <list>
#include <iostream>
#include <vector>

class HashTable 
{
private:
    // Each position in Hashtable has an array of lists to store elements in case of collision
    std::list<int>* table;

    // Size of hashtable
    int size;

    // Hashfunction that returns the array location for the given key
    int hash(int key);

public:

    HashTable(int tablesize);
    HashTable(std::vector<int> primes);

    // Adds element to hashtable
    void insert(int element);

    // Deletes an element by key 
    void remove(int key);

    // Returns an element from hashtable for a given key
    int find(int key);

    // Displays the hashtable
    void printTable();

    // Display histogram to illustrate elements distribution
    void printHistogram();

    // Returns the number of lists in hash table
    int getSize();

    // Returns the total number of elements in hash table
    int getNumberOfItems();

    // De-allocates all memory used for the Hash Table.
    ~HashTable();
};

#endif

I've already tried to exceed the size of the table size to eliminate collisions but I didn't notice any difference.

This is the result

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
  • 4
    That's a very nice graph. Looks like one would expect: The hash search has a constant time complexity, and the binary has a logarithmic one. It's just that the constant of the hash table is quite large. Vectors play very well with caches. – Petr Skocik Dec 06 '15 at 22:00
  • 1
    Probably unrelated to the benchmark; but the constructor should accept `primes` by reference – M.M Dec 06 '15 at 22:06
  • 4
    What happens to your timings if you change the type of `table` to `std::vector *`? – clstrfsck Dec 06 '15 at 22:06
  • Yeah, but I just use the vector to insert the elements. I only measure the time to find the element. The container used in the table is std::list* table. – user5310508 Dec 06 '15 at 22:08
  • Well if I change table to a vector, I cant handle any collisions, that's why I use an array of lists.. – user5310508 Dec 06 '15 at 22:09
  • 2
    What if you change `key % size` to `key % 12345`, I mean, hard-code the count? Will it be faster? I think, division may be a bit too slow. (btw, it's a bad kind of hash function generally, unless the divisor is a prime number). Also, do you compile your code with optimizations on or off? – Alexey Frunze Dec 06 '15 at 22:44
  • 1
    Use std::vector*, if only research matter sort collision. And get a const& of table[index] – Jerome Dec 06 '15 at 22:52
  • Where's your benchmarking code? – ysdx Dec 06 '15 at 23:10
  • 3
    `int tablesize = primes.size() * 20;` - that's a *lot* of wasted space (and, consequently, time) – Karoly Horvath Dec 06 '15 at 23:13
  • 4
    I believe @msandiford's suggestion was to use an array of vectors rather than an array of lists. All you do with the elements in the array is to search and append; `vector` should outperform `list` for that. – Alan Stokes Dec 06 '15 at 23:14
  • I love to see questions like these. :') – erip Dec 06 '15 at 23:24
  • 1
    This is a fairly poor hashtable implementation, as the comments above explain. So it has a very high time constant. However, it is a proper hashtable, so its performance is roughly constant even as the number of nodes increases. Unless the number of elements is very large, it will be beaten out by well-written binary search. Did you benchmark `std::unordered_map` or `std::unordered_set`? – David Schwartz Dec 23 '15 at 19:59

3 Answers3

4

Some things that are suboptimal with the hash table implementation:

  • primes.size() * 20 is excessive - you'll get a lot more cache misses than necessary; try a range of values between 1 and ~2 to find an optimal point

  • primes.size() * 20 is always even, and all the prime numbers you hash with key % size are odd, so you never hash into half the buckets, wasting space and degrading cache performance

  • you handle collisions with linked lists: that means you're always following at least one pointer away from the table's contiguous memory, which is slow, and for collisions you jump around in memory with each node in the list; using std::vector<int> to store colliding values would cap the jumping at 1 memory area outside the hash table, or you could use closed hashing / open addressed and displacement lists to typically find the element in a nearby hash table bucket: my benchmarks have found that around an order of magnitude faster for similar int values.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
1

If your data is completely random it may be difficult to find a good constant for the modulo operation. If your data follows some sort of pattern you may want to try running through a bunch of candidate constants to see which one performs best on your data.

In this post I showed how such a large-scale test could be structured. In the end my hash table produced an average lookup in 1.5 comparisons with a worst case of 14. The table contained 16000 entries, roughly 2^14.

Community
  • 1
  • 1
Olof Forshell
  • 3,169
  • 22
  • 28
0

It's all about complexity binary search is O(log n) , and your find is linear so O(n), at one point that was worst when you have a lot of collision.

Jerome
  • 529
  • 4
  • 14