Why does the hash() function in python take constant time to operate on strings of variable length?

Question

From the run time of programs including the hash() operation on variable length strings I felt that it might be that the hash() function takes constant time to operate on different length strings. To verify my assumption I made the following strategy -

Create strings of length k
Hash the string and record the hash() operation time t
This is repeated for k varying from 0 to 100,00 and a plot of string length k vs time t is generated.

Hence if my guess that the hash() function is a constant time operation when operating on strings is correct, can you please explain in layman terms why is it so? A conceptual or theoretical explanation rather than a reference to a source-code would be preferable- as to how even large strings produce a hash instantaneously, indifferent of character length.

The following is the code implementation of above-mentioned strategy -

import random
import time
import pylab

def form_str(k, n):
    """
        Form k-length strings of arbitrary characters, n in number.
    """
    for i in range(n):
        random_str = ''
        for j in range(k):
            nextchar = random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz')
            random_str += nextchar
        yield random_str

def time_hash(k, n):
    """
        Find the total time to hash n strings of k length each.
    """
    
    total_time = 0.0
    for element in form_str(k, n):
        start_time = time.time()
        # Because hash() works almost instantaneously we repeat
        # it 100,000 times to get a non-zero run time.
        for i in range(100000):
            hash(element)
        end_time = time.time()
        total_time += (end_time - start_time)
    return round(total_time, 2)

# print(time_hash(100, 100))  

def plot_time():
    """
        Plots the time vs string length (k) over a range of k-values.
    """
    x_values, y_values = [], []
    for k in range(0, 100000, 5000):
        x_values.append(k)
        y_values.append(time_hash(k, 100))
    # print(y_values)
    pylab.figure(1)
    pylab.title('Hash_Time_Complexity')
    pylab.xlabel('Length of string')
    pylab.ylabel('Time taken to hash string (100 x 100000 times)')
    pylab.plot(x_values, y_values, 'r:', label = 'Hash time')
    pylab.show()

plot_time()
# From the plot it is clear that indifferent of the string length
# hashing takes the same time.

The following is the generated plot -

Anmol Singh Jaggi · Accepted Answer · 2020-09-02T06:29:10.047

3

Since strings are immutable, the hashcode of a string is computed only once and cached thereafter.

A better way to benchmark would be to generate different(unique) strings of length k and average their hash time, instead of calling hash of the same string multiple times.

edited Sep 02 '20 at 06:29

answered Sep 02 '20 at 06:21

Anmol Singh Jaggi

8,376
4
36
77

1

Read the implementation here: https://svn.python.org/projects/python/trunk/Objects/stringobject.c: `if (a->ob_shash != -1) return a->ob_shash;` – M Z Sep 02 '20 at 06:24
@Anmol Singh Jaggi referring to your comment I would like to have one further clarification. When you say that the hash of a given string is cached I would just like to know if it is internally calculated already when the string object is formed? Thus hash() does not do any new calculation to find the hash; just ends up returning one that is already calculated and stored. – Anirban Chakraborty Sep 03 '20 at 03:32
1

No the hash is computed lazily when its called for the first time. – Anmol Singh Jaggi Sep 03 '20 at 06:19

Why does the hash() function in python take constant time to operate on strings of variable length?

1 Answers1