2

I'm learning data structure and algorithms. I was taught almost every operations of hashmap are almost O(1). This is a method that performs quadratic probing resolution, and it return the position of the desired object.

private int findPos( Object x )
{
    int offset = 1;
    int currentPos = myhash( x ); //Using quadratic probing 
    
    while( array[ currentPos ] != null &&
            !array[ currentPos ].element.equals( x ) )
    {
        currentPos += offset;  // Compute ith probe
        offset += 2;
        if( currentPos >= array.length )
            currentPos -= array.length;
    }
    
    return currentPos;
}

I was doing an homework, in which I had to found out the complexity of this method. This is the scenario:

The array represent an English dictionary where each element is a word from it. The answer of this homework is O(1) complexity. But I was wondering why isn't O(n) complexity, since there are a lot of elements inside (about 171,476 words) which will result in a lot of collision. In other words, how can I be sure, from this scenario, this method will always be O(1) ?

Edit The question of my homework was: What is the complexity of this method in the worst case scenario. Is it still O(1) ?

Emmanuelle
  • 73
  • 5
  • For each element, there are 2 cases: either there is a collision or there isn't. Consider the probability of both cases to calculate the estimated complexity of insertion for each element. Finally, take an average over the ```n``` values to find the average insertion time. – Abhinav Mathur Jun 01 '21 at 04:21
  • Most operations on hash tables are not O(1) and are, as you are saying O(n) or worse. It is their *expected* runtime that is O(1). – ldog Jun 01 '21 at 04:22
  • *"there are a lot of elements inside (about 171,476 words) which will result in a lot of collision"* - the chance of collision doesn't depend on the number of elements, it depends on the load factor. As long as the underlying array has enough space for the elements then catastrophic collisions (i.e. a series of collisions resulting in O(n) behaviour) are extremely unlikely. – kaya3 Jun 01 '21 at 07:41
  • @ldog So, as you said, they are O(1) expected time, not O(1) worst-case time. Just like quicksort is O(n log n) expected time but O(n^2) worst-case time. There is no default assumption that we're measuring the worst case unless we say otherwise. – kaya3 Jun 01 '21 at 07:43

2 Answers2

2

Non-worst case.... It is O(1) rather than O(n) because the search time does not depend on the size of the array. Yes, you'll get more collisions with more elements, but not more collisions per search on average, because the size of the array increases as n increases.

EDIT: Since the question stipulates that the data is words from English, there is a limit on the largest possible size for n, this means you can practically guarantee no collisions for your hash function which would make the answer O(1) even in worst case.

Bradley Thomas
  • 4,060
  • 6
  • 33
  • 55
2

What is the complexity of this method in the worse case scenario. Is it still O(1) even with the worse case scenario?

To answer your question, you really have to define what you mean by worst-case. The implementation of a hash table is very dependent on two factors: the (probability) distribution of data you intend to store in the hash table, and what hash function you define. Knowing these two factors you can do an average case analysis. Without these two pieces of information, you can assume the worst, for example:

  • you will always get the same value to store say the number '1'
  • your hash function is f(x) = x

In this case, if you are using linked-lists for buckets, your worst case is indeed O(n).

If, however, you have more information, such that your data follows a uniform distribution and you use a sensible hash function, you can show in the average case most operations should be O(1). In the previous sentence, the descriptor "average" is precisely defined by averaging over the probability distribution your data follows.


I'm not clear how detailed an analysis you require, but typically with open-addressing (doesn't matter which) the analysis of average runtime is as follows.

Let a = n / N be the load factor of the hash-table, where n is the number of elements stored and N the number of buckets. Assuming no clustering problems (due to data or hash function) and that all probes are equally likely (proof that this is true for quadratic probing is separate), then you can assert

P(probe hits occupied bucket) = a
P(probe hits unoccupied bucket) = 1-a
P(probe hits unoccupied bucket in 2 steps) = a (1-a)
P(probe hits unoccupied bucket in k steps) = a^{k-1} (1-a)

therefore the average runtime of probing is

E(number of steps in probe) = \sum {for k = 0 to m} k a^{k-1}(1-a)
                            <=\sum {for k = 0 to infty} k a^{k-1}(1-a)
                            = (1-a) / (1-a)^2
                            = 1 / (1 - a)

where the infinite sum is given by arithmetico-geometric series, and m is some upperbound on the number of steps in the probing (dependent on the probing technique, usually requires a separate proof that the probing technique terminates in a finite # of steps m.)

If we maintain a reasonable load factor a, say a = n / N = .5, then

E(number of steps in probe) <= 1 / (1-.5) = 2

thus E(number of steps in probe) is O(1).

The crux of the entire analysis rests in being able to recognize that the infinite sum has a nice closed form that converges.

To recap, the proof I gave above holds in the general case, but requires the following conditions on your probing technique that you are using (that may need to be proved separately):

  • the probing technique yields a uniform probability based on load factor of hitting an unoccupied/occupied bucket (i.e. P(probe hits occuppied bucket) = a)
  • the probing technique terminates in a finite number of steps

Depending on how detailed your analysis must be, you may have to prove these two properties of quadratic probing to complete the proof. (Note that quadratic probing is guaranteed to terminate only if the load factor is a < .5 )

ldog
  • 11,707
  • 10
  • 54
  • 70
  • So, to clarify with my example, the data used on the array are words from dictionary. Is the answer O(1) (of my homework) because the distribution of my data (words from dictionary) is uniform – Emmanuelle Jun 01 '21 at 04:40
  • Is there more information about the hash function used? It can have a very large effect on the analysis of the performance. – ldog Jun 01 '21 at 04:43
  • It use hashCode() Method from Java which use Horner algorithm, which i guess is a good hash function – Emmanuelle Jun 01 '21 at 05:07
  • Okay, I've updated the answer to reflect all the information you've given so far. Depending what you actually need to show YMMV, but I've atleast outlined the skeleton of the typical proof of constant runtime. – ldog Jun 01 '21 at 05:13
  • So, because i'm using quadratic probing with 0.5 , it solves the "yields a uniform probability" and the "terminates in a finite number of steps". So then , on average runtime, it is impossible the method findPos() might be O(n) even on worse case scenario (in your exemple it converge to 2), it is correct ? – Emmanuelle Jun 01 '21 at 05:42
  • Yes that is correct, the proof I showed guarantees O(1) in the worst average case. – ldog Jun 01 '21 at 06:37