1

From the run time of programs including the hash() operation on variable length strings I felt that it might be that the hash() function takes constant time to operate on different length strings. To verify my assumption I made the following strategy -

  • Create strings of length k
  • Hash the string and record the hash() operation time t
  • This is repeated for k varying from 0 to 100,00 and a plot of string length k vs time t is generated.

Hence if my guess that the hash() function is a constant time operation when operating on strings is correct, can you please explain in layman terms why is it so? A conceptual or theoretical explanation rather than a reference to a source-code would be preferable- as to how even large strings produce a hash instantaneously, indifferent of character length.

The following is the code implementation of above-mentioned strategy -

import random
import time
import pylab

def form_str(k, n):
    """
        Form k-length strings of arbitrary characters, n in number.
    """
    for i in range(n):
        random_str = ''
        for j in range(k):
            nextchar = random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
            random_str += nextchar
        yield random_str

def time_hash(k, n):
    """
        Find the total time to hash n strings of k length each.
    """
    
    total_time = 0.0
    for element in form_str(k, n):
        start_time = time.time()
        # Because hash() works almost instantaneously we repeat
        # it 100,000 times to get a non-zero run time.
        for i in range(100000):
            hash(element)
        end_time = time.time()
        total_time += (end_time - start_time)
    return round(total_time, 2)

# print(time_hash(100, 100))  

def plot_time():
    """
        Plots the time vs string length (k) over a range of k-values.
    """
    x_values, y_values = [], []
    for k in range(0, 100000, 5000):
        x_values.append(k)
        y_values.append(time_hash(k, 100))
    # print(y_values)
    pylab.figure(1)
    pylab.title('Hash_Time_Complexity')
    pylab.xlabel('Length of string')
    pylab.ylabel('Time taken to hash string (100 x 100000 times)')
    pylab.plot(x_values, y_values, 'r:', label = 'Hash time')
    pylab.show()

plot_time()
# From the plot it is clear that indifferent of the string length
# hashing takes the same time.

The following is the generated plot - enter image description here

Anirban Chakraborty
  • 539
  • 1
  • 5
  • 15

1 Answers1

3

Since strings are immutable, the hashcode of a string is computed only once and cached thereafter.

A better way to benchmark would be to generate different(unique) strings of length k and average their hash time, instead of calling hash of the same string multiple times.

Anmol Singh Jaggi
  • 8,376
  • 4
  • 36
  • 77
  • 1
    Read the implementation here: https://svn.python.org/projects/python/trunk/Objects/stringobject.c: `if (a->ob_shash != -1) return a->ob_shash;` – M Z Sep 02 '20 at 06:24
  • @Anmol Singh Jaggi referring to your comment I would like to have one further clarification. When you say that the hash of a given string is cached I would just like to know if it is internally calculated already when the string object is formed? Thus hash() does not do any new calculation to find the hash; just ends up returning one that is already calculated and stored. – Anirban Chakraborty Sep 03 '20 at 03:32
  • 1
    No the hash is computed lazily when its called for the first time. – Anmol Singh Jaggi Sep 03 '20 at 06:19