5

Obviously, in Python to check whether two strings are equal you can do:

"hello word" == "hello world"

But what if you are comparing really long strings (in excess of 1m characters)? Is there a built in way or any libraries in python that can do this much faster; perhaps utilizing the Karp–Rabin algorithm or something similar?

Or, under the hood, is stringA == stringB actually the fastest method?

pablowilks2
  • 299
  • 5
  • 15
  • 1
    No modern programming language compares strings character by character. The comparison is just as fast regardless of how long the strings are. – Guy Incognito May 16 '20 at 09:03
  • 4
    Actually, I think Python keeps a hash of each string, in which case it presumably compares the hashes first when doing a string comparison. That should make it fast even for very long strings. Although it would still have to do a character-by-character comparison if the hashes match. – Tom Karzes May 16 '20 at 09:03
  • why is an *algorithm* required for this..simply compare using `==` , `!=` or `is` (return bool values). – de_classified May 16 '20 at 09:04
  • 1
    @GuyIncognito That's not true. If two strings are equal, but not the same object, then the only way to verify it is to compare every character of both strings. That will take much longer for long strings. And for languages that rely on null-terminated strings, without a length count, string comparisons will be slow for long strings. – Tom Karzes May 16 '20 at 09:11
  • 3
    @RoadRunner Of course, the hashes are functions of the string values, so they have to match. That makes inequality fast in the vast majority of cases. But when two hashes are the same, then it has to fall back on an actual string comparison. – Tom Karzes May 16 '20 at 09:13
  • 1
    An algorithm is required because the OP wants to compare very long strings and is concerned that algorithm/method implemented in ‘==‘ may not be especially fast – innisfree May 16 '20 at 09:13
  • 2
    Have you profiled the time taken? E.g how long does it take to compare strings of length 10, 100, 1000 etc using ==. How in practice does the time scale with length of string? I’d be interested in strings that are different (giving false) and ones that are the same (giving true). The latter may be slower – innisfree May 16 '20 at 09:16
  • 3
    This will be useful to read: [what happens when you compare two strings in python](https://stackoverflow.com/questions/43466106/what-happens-when-you-compare-two-strings-in-python) – RoadRunner May 16 '20 at 09:36
  • @TomKarzes There is a length check fast-path, but I [don't see any hash check](https://github.com/python/cpython/blob/62ecd8a8f908282726d2f019c93efa1cf2e9e784/Objects/unicodeobject.c#L11159-L11178). IIRC the hash of strings is calculated and stored at time of first use, so it could be that a hash is not necessarily available to use for equality comparisons if there was no prior reason to hash the string. – wim May 17 '20 at 17:01
  • 2
    Regular strings are not several million elements long. What do these strings *represent*? Most importantly, what are the inherent rules of the data that can be exploited? Do you expect strings to be predominantly equal or unequal? Do you expect inequality rather at starting or ending, or other specific positions? Do you expect minor (some chars) or major (some thousand chars) differences? Do you compare the same string multiple times (e.g. pairwise from list)? Do you need to compare only individual pairs, or many? What is your goal? Eliminate duplicates, find matches, ...? – MisterMiyagi May 17 '20 at 17:07
  • @wim That's interesting. You'd think it would use the hash value when it's available. – Tom Karzes May 17 '20 at 19:10
  • 1
    @TomKarzes Maybe the cost of checking whether or not a hash is available to use isn't worth the potential gains (I guess you would only get significant gains for inequality rejection on long strings, which is unlikely to be a typical use case). – wim May 17 '20 at 19:25
  • @wim Well, long string comparison in general is probably uncommon. The check would make the biggest difference when comparing lots of long strings with a high mismatch rate. – Tom Karzes May 17 '20 at 19:51
  • @wim Of course, in that case you'd be better off keeping the string in a `dict`, which would make a hash-based search. – Tom Karzes May 17 '20 at 19:57

1 Answers1

6

(EDITED: to improve overall quality).

Considering how str == str is implemented in Python, this first gets an id() check, length check and then goes on element by element. This is quite fast and understandably so, since a lot of Python code relies on this. In the average case, there is no need for further optimization as arbitrary strings will be different quite early.

However, there are two use cases where there is some room for optimization:

  • you have some partial information on how the two inputs are going to be different.
  • you perform multiple comparisons among a certain set of elements (see @wim comments).

An example of the first situation is: if you know that when two inputs of the same size are different, they are likely different for at least m contiguous elements, then performing a comparison every k elements with k < m will be a reasonable bet, e.g.:

def is_k_equal(a, b, k=4096):
    if k in {0, 1}:
        return a == b
    else:
        return a[::k] == b[::k]


def is_equal_partial(a, b, partial_match=is_k_equal):
    return len(a) == len(b) and partial_match(a, b) and a == b

An example of the second situation is: if you want to know which p inputs out of q are pairwise equal, it may be beneficial to compute a hash (for example using hash(), but other options may be equally valid) of your inputs and only perform a full comparison when the hashes match. It goes without saying that if your hash has high collision rating, it may just introduce additional overhead (see Wikipedia for information on hashing). The hashes of the input could be either manually managed, or you could guard your full comparison with a hash comparison in a is_equal() function, e.g.:

def is_equal_hashing(a, b, hashing=hash):
    return len(a) == len(b) and hashing(a) == hashing(b) and a == b

provided that your hashing function is memoized. For hash() you do not need to do anything else, as this is already memoized for these inputs. If you were to use a fancier hashing (e.g. crc32, md5, etc.), you may need to add memoization yourself, e.g with @functools.lru_cache. The condition for this use-case seeing benefits from this approach (ignoring the time required to compare hashes which is usually much faster then the other operations to be considered) is:

t_h * n_h < t_c * n_c

with t_h the initial hash computation time, n_h the number of unique hashes to compute, t_c the comparison time, and n_c the number of full comparisons which fail near the end of the inputs.

When in doubt on how things will perform on your input, it is typically a good idea to measure / profile your code.

Care must be taken when timing memoized functions (like hash()), because, if you are interested in the performance of the non-memoized path, you cannot rely on timings of multiple repeated calls of the same input as it is typically done, for example with IPython's %timeit using default parameters. Instead, you may use %timeit -n1 -r1 for un-cached timings. The results would only be useful for order of magnitude estimates.

To give you some ideas on how fast the possible ingredients of your approach are, here are some micro-benchmarks:

import hashlib
import functools


def md5(data):
    return hashlib.md5(data).digest()


@funtools.lru_cache(maxsize=16384)
def sha1(data):
    return hashlib.sha1(data).digest()


def sha256(data):
    return hashlib.sha1(data).digest()


def sha512(data):
    return hashlib.sha1(data).digest()
import numpy as np
import numba as nb


@nb.jit(fastmath=True)
def hash_sum_nb(data, init=0):
    dtype = np.uint64
    nbytes = 8
    n = len(data)
    offset = n % nbytes
    result = init
    if offset:
        body = np.frombuffer(data[:-offset], dtype=dtype)
        tail = np.frombuffer(data[-offset:], dtype=np.uint8)
        for x in tail:
            result += x
    else:
        body = np.frombuffer(data, dtype=dtype)
    for x in body:
        result += x
    return result + n
import zlib
import string
import random


n = 1_000_000
s = b''.join(string.printable[random.randrange(len(string.printable))].encode() for _ in range(n))
funcs = hash, hash, zlib.crc32, zlib.adler32, md5, sha1, sha1, sha256, sha512, hash_sum_nb
for func in funcs:
    result = %timeit -n1 -r1 -q -o func(s)
    print(f'{func.__name__:>12s}  {result.best * 1e6:.3f} µs')
#         hash  586.401 µs
#         hash  0.853 µs
#        crc32  976.128 µs
#      adler32  468.452 µs
#          md5  1790.659 µs
#         sha1  1362.948 µs
#         sha1  1.423 µs
#       sha256  1347.432 µs
#       sha512  1321.981 µs
#  hash_sum_nb  64.768 µs


cases = {
    'worst case': (s[:-1] + b'x', s[:-1] + b'y'),
    'best case*': (s[:-1], s[:-2]),
    'best case': (b'x' + s[:-1], b'y' + s[:-1]),
}
for name, (a, b) in cases.items(): 
    result = %timeit -n1 -r1 -q -o a == b
    print(f'str == str ({name:10s})  {result.best * 1e6:.3f} µs')
# str == str (worst case)  142.466 µs
# str == str (best case*)  0.856 µs
# str == str (best case )  1.012 µs


a, b = (s[:-1] + b'x', s[:-1] + b'y')
result = %timeit -n1 -r1 -q -o is_k_equal(a, b)
print(f'{is_k_equal.__name__:>12s}  {result.best * 1e6:.3f} µs')
# is_k_equal  10.037 µs

Note that both hash() and sha1() are called twice on the same input to show the effects of memoization.

With this data (or similar numbers that you could produce on your input / system), it may be possible to craft a more performant string equality comparison. Note that throughout the answer I used bytes instead. The timings for str would generally be worse for most hashing, because of the additional overhead required to handle the encoding, with the notable exception of hash().


norok2
  • 25,683
  • 4
  • 73
  • 99
  • Are your tests for when the strings are indeed equal? Is there much difference when they are different? – innisfree May 16 '20 at 10:31
  • 1
    @innisfree I have added some more insights on the input. As you can see, the tests challenges mostly when the two strings are not equal. If you were to test only equal strings, the `len()` and `hash()` guarding would just make things slower, of course. – norok2 May 16 '20 at 10:36
  • Got it. So if you had some information about how plausible it was or how frequently the strings were identical, you could us it to minimise the expected run time – innisfree May 16 '20 at 10:42
  • 2
    Whether the builtin hash helps or not depends if you want to compare the same obj to multiple others, or just once. Because strings cache their hash in CPython. – wim May 17 '20 at 18:56
  • @wim The `is_equal_hash()` is ~8 times slower, so calling `hash()` once is going to be still ~4 times slower... not enough to compensate even a single `str == str` call, unless you compare pairwise the same cached strings over and over again, and even then, wouldn't [`functools.lru_cache`](https://docs.python.org/3/library/functools.html#functools.lru_cache) be more effective? – norok2 May 17 '20 at 22:33
  • @norok2 No the case I'm hinting at is where you want compare many to many, e.g. if you've already compared a with b and c with d, then you have cached hashes now and you can compare a with c and b with d faster. – wim May 17 '20 at 23:16
  • @wim OK, but that is still a fairly special use case, and will only be beneficial beyond a certain point. For strings with different length, this may never be beneficial. – norok2 May 18 '20 at 00:00
  • 1
    Agreed. Only mention it because I actually had this exact special use case once in https://stackoverflow.com/q/7646520/674039 – wim May 18 '20 at 00:25