4

I am looking for verification on two different but related arguments-- those above (A) and below (B) the first line line-comment here in the Q.

(A) The way HashMap is structured is:

a HashMap is a plain table. thats direct memory access (DMA).

The whole idea behind HashMap (or hashing in general) at the first place is to put into use this constant-time memory access for

a.) accessing records by their own data content (< K,V >), not by their locations in DMA (the table index)

b.) managing variable number of records-- a number of records not of a given size, and may/not remain constant in size throughout the use of this structure.

So, the overall structure in a Java Hash is:

a table: table // i`m using the identifier used in HashMap

each cell of this table is a bucket.

Each bucket is a linked list of type Entry-- i.e., each node of this linked list (not the linked list of Java/API, but the data structure) is of type Entry which in turn is a < K,V > pair.

When a new pair comes in to be added to the hash, a unique hashCode is calculated for this < K,V > pair. This hashCode is the key to the index of this < K,V > in table-- it tells which bucket this < K,V > will go in in the hash. Note: hashCode is "normalized" thru the function hash() (in HashMap for one) to better-fit the current length of the table. indexFor() is also at use to determine which bucket, i.e., cell of table the < K,V > will go in.

When the bucket is determined, the < K,V > is added to the beginning of the linked list in this bucket-- as a result, it is the first < K,V > entry in this bucket and the first entry of the linked-list-that-already-existed is now the "next" entry that is pointed by this newly added one.

//===============================================================

(B) From what I see in HashMap, the resizing of the table-- the hash is only done upon a decision based on hash size and capacity, which are the current and max. # entries in the entire hash.

There is no re-structuring or resizing upon individual bucket sizes-- like "resize() when the max.#entries in a bucket exceeds such&such".

It is not probable, but is possible that a significant number of entries may be bulked up in a bucket while the rest of the hash is pretty much empty.

If this is the case, i.e., no upper limit on the size of each bucket, hash is not of constant but linear access-- theoretically for one thing. It takes $O(n)$ time to get hold of an entry in hash where $n$ is the total number of entries. But then it shouldn't be.

//===============================================================

I don't think I'm missing anything in Part (A) above.

I'm not entirely sure of Part (B). It is a significant issue and I'm looking to find out how accurate this argument is.

I'm looking for verification on both parts.

Thanks in advance.

//===============================================================

EDIT:

Maximum bucket size being fixed, i.e., hash being restructured whenever the #entries in a bucket hits a maximum would resolve it-- the access time is plain constant in theory and in use.

This isn't a well structured but quick fix, and would work just fine for sake of constant access.

The hashCodes are likely to be evenly distributed throughout the buckets and it isn`t so likely that anyone of the buckets will hit the bucket-max before the threshold on the overall size of the hash is hit. This is the assumption the current setup of HashMap is using as well.

Also based on Peter Lawrey`s discussion below.

Roam
  • 4,831
  • 9
  • 43
  • 72

2 Answers2

3

Collisions in HashMap are only a problem in pathological cases such as denial of service attacks.

In Java 7, you can change the hashing strategy such that an external party cannot predict your hashing algo.

AFAIK, In Java 8 HashMap for a String key will use a tree map instead of a linked list for collisions. This means O(ln N) worst case instead of O(n) access times.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • how exactly do you define collision-- too big bucket sizes or multiple values for keys? 'HashMap' isn't worrying any in either of these cases. – Roam Aug 01 '13 at 20:47
  • I define a collision as multiple keys are placed in the same bucket. HashMap assumes you have a decent hashCode() algo and you have control over your key values. This is usually true, but not always. – Peter Lawrey Aug 01 '13 at 20:49
  • I see your point in Java 8. this explains, and "constant-time-access" of Java hash in current implementations is nothing but a myth. – Roam Aug 01 '13 at 20:50
  • 2
    Worst case timing are only really a problem for denial of service attacks. i.e. where you don't control the keys and someone is deliberately trying to slow you down. If this is not the case you can assume near constant time. – Peter Lawrey Aug 01 '13 at 20:51
  • a hashCode alg. would either have to resize the hash based on max.bucket size or make its decision on which-bucket also on the fullness of the buckets. – Roam Aug 01 '13 at 20:52
  • That won't help if every key has the same hashCode. e.g. http://stackoverflow.com/questions/2310498/why-doesnt-strings-hashcode-cache-0 – Peter Lawrey Aug 01 '13 at 20:52
  • 1
    i'd agree the expected time complexity being constant or "near-constant". the possibility of the buckets being extremely unbalanced is very low. but theoretically, the access time is not constant. – Roam Aug 01 '13 at 20:54
  • True, I would describe it as O(log log N) based on actual observation. Even if you have a perfect hash like Integer.hashCode() you don't have so many buckets that you can prevent collisions. – Peter Lawrey Aug 01 '13 at 20:55
  • Pls also see my "EDIT" for not a well-strc'd but quick fix. this would achieve constant-time access without the tree structure of Java 8. – Roam Aug 01 '13 at 21:06
  • @Roam your EDIT assumes unlimited memory. – zapl Aug 01 '13 at 21:13
  • i`m putting limit on the bucket sizes, on top of a limit on the hash size. if youre referring to more often increases in the table size, thats something else but not unlimited. – Roam Aug 01 '13 at 21:19
  • actually, the memory is linear in the # pairs coming in in my fix while it isn't in the current setup. in the extreme case, all entries can be in one bucket and the hash keeps being resized-- while the #empty buckets is being doubled each time. – Roam Aug 01 '13 at 21:23
  • @Roam I think I don't really understand you here but: The table is an array of buckets. Same hash goes to same bucket. Determining the bucket is O(1). Finding the right element inside the bucket is where you get the slowdown. And if you limit the size of a bucket, you limit the amount of elements that may have the same hash. If everything has the same hash, you are guaranteed to be below constant time. – zapl Aug 01 '13 at 21:32
  • @zapl i see your point. im looking to increase the table size when "everything is in the same hash". the hash-to-bucket mapping changes when the size of the table does-- see indexFor(). the elements can have whichever hash they want still. im just dispersing them more evenly-- on shorter buckets in larger tables. – Roam Aug 01 '13 at 21:44
  • @zapl ..this is at the cost of memory space. very rare to achieve improvement in time without taking advantage of memory. but, with the assumption that HashMap i see is using, it is nowhere near a drastic memory usage and is never infinite. – Roam Aug 01 '13 at 21:46
  • 1
    @Roam See http://en.wikipedia.org/wiki/Hash_table - If you create 1000 Objects where `hashCode()` returns 42 on each then you have to have one bucket in your table that holds 1000 elements. It does not matter what size the table is. The function that determines bucket index from object's `hashCode()` and table size *must* put every element into the same bucket if `hashCode()` returns the same value or you would no longer have a hash map – zapl Aug 01 '13 at 22:00
  • that's exactly what's spoiling "hash provides constant access". youre looking at my argument from some other angle. – Roam Aug 01 '13 at 23:41
1

I'm looking to increase the table size when everything is in the same hash. The hash-to-bucket mapping changes when the size of the table does.

Your idea sounds good. And it is completely true and basically what HashMap does when the table size is smaller than desired / the average amount of elements per bucket gets too large. It's not doing that by looking at each bucket and checking if there is too much in there because it's easy to calculate that.

The implementation of HashMap.get() in OpenJDK according to this is

public V get(Object key) {
    if (key == null)
        return getForNullKey();
    int hash = hash(key.hashCode());
    for (Entry<K,V> e = table[indexFor(hash, table.length)];
         e != null;
         e = e.next) {
        Object k;
        if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
            return e.value;
    }
    return null;
}

That shows how HashMap finds elements pretty good but it's written in very confusing ways. After a bit of renaming, commenting and rewriting it could look roughly like this:

public V get(Object key) {
    if (key == null)
        return getForNullKey();

    // get key's hash & try to fix the distribution.
    // -> this can modify every 42 that goes in into a 9
    // but can't change it once to a 9 once to 8
    int hash = hash(key.hashCode());

    // calculate bucket index, same hash must result in same index as well
    // since table length is fixed at this point.
    int bucketIndex = indexFor(hash, table.length);
    // we have just found the right bucket. O(1) so far.
    // and this is the whole point of hash based lookup:
    // instantly knowing the nearly exact position where to find the element.


    // next see if key is found in the bucket > get the list in the bucket
    LinkedList<Entry> bucketContentList = table[bucketIndex];

    // check each element, in worst case O(n) time if everything is in this bucket.
    for (Entry entry : bucketContentList) {
        if (entry.key.equals(key))
            return entry.value;
    }
    return null;
}

What we see here is that the bucket indeed depends on both the .hashCode() returned from each key object and the current table size. And it will usually change. But only in cases where .hashCode() is different.

If you had an enormous table with 2^32 elements you could simply say bucketIndex = key.hashCode() and it would be as perfect as it can get. There is unfortunately not enough memory to do that so you have to use less buckets and map 2^32 hashes into just a few buckets. That's what indexFor essentially does. Mapping large number space into small one.

That is perfectly fine in the typical case where (almost) no object has the same .hashCode() of any other. But the one thing that you must not do with HashMaps is to add only elements with exactly the same hash.

If every hash is the same, your hash based lookup results in the same bucket and all your HashMap has become is a LinkedList (or whatever data structure holds the elements of a bucket). And now you have the worst case scenario of O(N) access time because you have to iterate over all the N elements.

zapl
  • 63,179
  • 10
  • 123
  • 154