1

There is a contradiction on whether the insert operation on dict() or add operation in set() is O(n) or O(1), where n is the length of the string.

Suppose we have strings which vary in length i.e. n1, n2, ...n_x. Then the time complexity of performing the following:

s = set()
d = dict()
for x in {N}: # where N = [n1, n2, ... n_x]
  s.add(x)
  d[x] = 1

is O(len(N) * Z) where Z = len(n_1) + len(n_2) + ... len(n_x) If we assume that add or insert is O(1) operation then the time complexity will be O(len(N)).

Is the above true?

From: http://svn.python.org/projects/python/trunk/Objects/stringobject.c we see that the computation of the hash depends on the length of the string which is what I am assuming is len below:

static long string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;

    if (a->ob_shash != -1)
        return a->ob_shash;
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

Here (efficiency of long (str) keys in python dictionary) someone showed that varying the length of the string does not affect the time of computing a hash. But this contradicts the above code.

From the following link we know that the hash value once computed is stored in the object. This means that lookup will be constant time O(1). Get dictionary keys hashes without recalculation However, the insertion/adding which is when the computation of the hash is done should be linear.

Daniel Dror
  • 2,304
  • 28
  • 30
block-ch
  • 27
  • 1
  • 6
  • The size of each string is *independent* of the number of items already in the set. `len(N) * Z` consists of two *different* `len` functions: one that computes the length of the set, and another that computes the length of a string; and the second one is *constant* in terms of the number of items in the set. – chepner May 02 '19 at 14:40
  • Wait, suppose N is the input supplied to the function where N varies that cannot be considered constant, is it? – block-ch May 02 '19 at 14:47
  • Nothing about the *algorithm* assumes or cares if input will be `N`; if you provide such an input, that's a *coincidence*, not something that affects the *asymptotic* worst-case running time of the operation. – chepner May 02 '19 at 14:58

2 Answers2

3

There are gazillion things on which the performance of insert depends on. The calculation of hash function indeed is O(k) for a string of length k, but it is just uninteresting in general case.

If you consider string keys of only 8 bytes of length, there are 18446744073709551616 different combinations and 8 is a constant, calculation of hash for 8-byte key is O(8) is O(1).

But at 18446744073709551616 items, insertion to hash table could still take 1 µs. And for list, where insertion to the beginning would be O(n), and the insertion/copying of one item took only one nanosecond at the end of the list, insertion to the beginning of a list of that many items could take 585 years.

OTOH, while it is conceivable that you might have a collection of 4294967296 or even 18446744073709551616 items, if you've got a key of 4294967296 or 18446744073709551616 bytes to your hash table you seriously need to rethink your architecture.

  • But then why is it that in here the hash computation appears to be O(1): https://stackoverflow.com/questions/28150047/efficiency-of-long-str-keys-in-python-dictionary – block-ch May 02 '19 at 14:33
  • Good point. Of course it is case of Martijn being wrong (again). The hash code is cached on the string. – Antti Haapala -- Слава Україні May 02 '19 at 14:36
  • Sure, the hash is cached on the string once computed using O(k) time. Also others have been saying that computing hash in a hash table is independent on the length of the string i.e. some fixed number of bits is used in the computation: https://stackoverflow.com/questions/2771368/can-hash-tables-really-be-o1 . I tried to disprove that by showing the python source code of the implementation of the hash function. The link could actually be right if python is not using that hash function but some other one, hence my confusion over everything. – block-ch May 02 '19 at 14:41
  • Thanks, so, in here https://wiki.python.org/moin/TimeComplexity in the dict section, set item is said to be O(1). It is O(1) if we are setting a value to a key which has already been computed and had a value assigned, but if we have all the keys different and being computed the first time the time complexity is O(k) where k is the length of each string which could vary. Is that correct? – block-ch May 02 '19 at 14:56
  • It is O(1) as a function of the size of the collection. – Antti Haapala -- Слава Україні Mar 06 '21 at 15:34
0

I added a new answer on the link that you gave, and I link it again here. https://stackoverflow.com/a/70252226/800337

To justify why it's O(N) on the length of the key, and why a cached hash value doesn't really matter, imagine this scenario...

You have a filesystem full of files, with an average size of 30 Kbytes each. You want to find duplicate files, and your strategy is to read files one at a time, and do something like this:

with open(filespec,'rb') as f:
    contents = f.read()
# If file contents are new, insert the contents as a key
#  and the filespec as the value
# If file contents are a duplicate, don't change the
#  dictionary, and return the first filespec where we
#  saw these contents.
spec = filedict.setdefault(contents, filespec)
if spec != filespec:
    print(f'{filespec} is a duplicate of {spec}')

I'm not saying this is the best way to find duplicate files; there are clearly better ways. But lets analyze it anyway.

For each file that we read in, the call to the setdefault() method does need to hash the file contents. This is because we didn't get the content bytes object from someplace that was already hashed, we got it from a file read. (If we were looking up the same file content bytes in multiple dictionaries, yeah, the second and subsequent lookups can do hashing in O(1), but in this use case, that doesn't apply.)

So lookup of a newly generated bytes object is O(N) just because we haven't cached a hash value yet.

Now assume that the lookup actually hits in the dictionary -- we've seen a file before with the same contents. The lookup isn't over just because we found an item in the right hash bucket -- the implementation is now going to do a bytes comparison of the new bytes object against the bytes object in the dictionary. Because they match, the entire bytes object will be compared -- all 30 Kbytes or whatever.

So for this use case, finding a duplicate file results in two operations which are O(N) with respect to the length of the files -- after reading the file, we need to hash it, and on lookup, we need to do a complete byte comparison of the new key and the existing key.

A key takeaway from this is that if you're looking up a lengthy key which is already in the dictionary, you're going to do that O(N) comparison of the key being looked up against the key that's in the dictionary -- unless the two keys are the same exact object, in which case Python is generally smart enough to trivially decide that if (a is b) then (a == b).

However, as implied by the timing statistics I gave on the other linked answer, you can probably assume that for small keys up to about 1K in length, the O(N) contribution to hashing and key comparison is small enough to be ignored. It doesn't really start to show up in a noticeable way until the key lengths start to climb significantly above 1K in size.

Jack Brennen
  • 353
  • 1
  • 5