0

I implemented a HashTable with variable size buckets on init of the Class, merely an Array of linked lists sized at Runtime.

The issue is that with a small number of buckets where the linked-list must be traversed (can reach approx. 5K nodes in depth) is outperforming a HashTable with more buckets differing on three orders of magnitude greater.

    int SMALL_BUCKET_SIZE = 10;
    int BIG_BUCKET_SIZE = 10000;

    HashTable<String, Integer> smallHashTable = new HashTable<>(SMALL_BUCKET_SIZE);
    HashTable<String, Integer> bigHashtTable = new HashTable<>(BIG_BUCKET_SIZE);

I would expect the larger HashTable to be O(1) for search where the smaller hashtable having a higher collision rate, taking more time due to traversal of the linked nodes, yet my numbers below are showing the smaller table outperforming the wider table.

Fetch SmallTable: 0.000007
Fetch BigTable: 0.000018

So I decide to loop my HashTable.get a thousand times to factor for JIT and JVM Optimization. Now I'm starting to see numbers that seem to confirm what I'd expect.

Fetch SmallTable: 0.0000013630
Fetch BigTable: 0.0000002560

My question is around the sound-ness of my logic as well as the additional moving parts here. I've pasted my test alongside a link to the implementation of the HashTable and underlying Node structures.

Looking for depth/experience from folks here who may be able to provide interactive feedback regarding variables that factor into this such as key length and hashing collision rates, bucket density, etc.

HashTableTest.java

@Test
public void canInitializeHashTableWithBucketsForPerformance() throws InterruptedException {
    double smallTableTime, bigTableTime;
    int SMALL_BUCKET_SIZE = 10;
    int BIG_BUCKET_SIZE = 10000;

    HashTable<String, Integer> smallHashTable = new HashTable<>(SMALL_BUCKET_SIZE);
    HashTable<String, Integer> bigHashtTable = new HashTable<>(BIG_BUCKET_SIZE);
    List<String> strings = generateRandomStringKeys(1000);

    strings.forEach(string -> bigHashtTable.put(string, 10));
    strings.forEach(string -> smallHashTable.put(string, 10));

    Consumer<String> bigHashGet = bigHashtTable::get;
    Consumer<String> smallHashGet = smallHashTable::get;

    String theString = strings.get(strings.size() - 1);

    smallTableTime = getElapsedTimeFactoringOutJavaOptimization(theString, smallHashGet);
    bigTableTime = getElapsedTimeFactoringOutJavaOptimization(theString, bigHashGet);

    System.out.println(String.format("Fetch SmallTable: %.10f", smallTableTime));
    System.out.println(String.format("Fetch BigTable:   %.10f", bigTableTime));

    assertTrue(smallTableTime > bigTableTime);
}

public double getElapsedTimeFactoringOutJavaOptimization(String s, Consumer<String> aMethod) {
    long start = 0, end = 0;

    for (int i = 0; i < 1000; i++) {
        start = System.nanoTime();
        aMethod.accept(s);
        end = System.nanoTime();
    }

    return (end - start) / 1_000_000_000D;
}

public List<String> generateRandomStringKeys(int numOfRandomKeys) {
    List<String> keys = new ArrayList<>();

    for (int i = 0; i < numOfRandomKeys; i++) {
        byte[] array = new byte[10];
        new Random().nextBytes(array);
        keys.add(new String(array, Charset.forName("UTF-8")));
    }

    return keys;
}

The test can be found here - Github - HashTableTest.java

The implementation can be found here as well - Github - HashTable.java

mcnichol
  • 175
  • 2
  • 14
  • Benchmarking an operation like this is going to be harder than just running it 1000 times. (1000 times is nowhere near enough, for starters.) – Louis Wasserman Nov 08 '18 at 23:40
  • You should also average the speed of multiple rounds. Move the `System.nanoTime()` calls outside of the loop – flakes Nov 08 '18 at 23:43
  • @flakes Notice that I'm basically dropping all of the System.nanoTimes on the floor except for the last one. Moving them outside the loop would capture the entire time the loop ran, not a single final iteration. – mcnichol Nov 08 '18 at 23:59
  • @LouisWasserman - I increased the loop. It seems to peek out around 40k - 100k iterations. Curious if you typically see higher than that. – mcnichol Nov 09 '18 at 00:02
  • @mcnichol: please define what you mean by "peek out". – Louis Wasserman Nov 09 '18 at 00:03
  • My mistake, peak as in times are not decreasing and 40 - 100k seem to be the max iterations signaling any change. lexers and stacks on the brain. – mcnichol Nov 09 '18 at 00:16
  • @LouisWasserman Seeing you already mentioned 10MM iterations in the below post. – mcnichol Nov 09 '18 at 00:18
  • @flakes - I averaged the time like you and Louis both mentioned and the times were much more stable, thanks. Have updated the repo, appreciate the feedback. – mcnichol Nov 09 '18 at 01:09

1 Answers1

1

There are many things wrong here, but a handful include:

  • Running this operation 1000 times and taking the difference of nanoTime for each of them does not make your benchmark valid. Seriously, use JMH. Or at least run it, like, ten million times.
  • Your hash table doesn't actually work any differently for different-sized tables. You use table[getHash(key) % RADIX], which basically means however big the table is, you only use 10 buckets in it and pretend the rest don't exist.
  • System.identityHashCode is not a useful hash function, especially on strings, especially when you're hoping to actually find an element that's in there...or not.
  • While you're at it, you're not using Node.next as a field, and might as well get rid of it.
Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • Point 1: I use only the last value from the final loop. Point 2: Apologies, you were faster on the draw than my push to Github. It was changed to `Node nodeTree = table[getHash(key) % table.length];` Point 3: Would love some clarity on this. For System.identityHashCode I read that this was what Object.hash actually calls underneath. Trying to read up more on this, purely for academic reasons. Point 4: It's possible this is related to the same Push/timing comment I made earlier. I'm using Node.next for recursively iterating through instead of my previous loop for searching. – mcnichol Nov 08 '18 at 23:56
  • 1: Using only the last value from the final loop makes it actually worse. 3: `System.identityHashCode` is only used if the objects do not define a sensible hash function of their own. `String` has a sensible hash code. – Louis Wasserman Nov 08 '18 at 23:57
  • 1: Interested in how it is worse. My reasoning was that warming up values didn't matter and that I could take the final one. Are you saying averaging all values would be a more accurate representation than final value? 3: I had made the key generic and plan on overriding because due to prime 31 being leveraged and looking to test performance on larger primes in reference to a comment made in [another thread](https://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier#comment2856302_299748). Have you played in this space already? – mcnichol Nov 09 '18 at 00:13
  • 1. The granularity of `System.nanoTime` isn't necessarily fine enough to give useful measurements for a single iteration; better to time multiple iterations in one go. Averaging all values would almost certainly be better. 3. If you want to make the key generic, then call the `hashCode` method of whatever the key is -- meaning `key.hashCode()`. – Louis Wasserman Nov 09 '18 at 00:20
  • I'll mark this one as the answer. Would love any additional feedback if you have any more. I know you mentioned there were many things wrong in this one. The main thing I'm walking away with being running JMH instead of looping with the average and dropping System.identity for the objects hash. I've updated it with your feedback, thanks. – mcnichol Nov 09 '18 at 01:03