Compare strings in a list to another list of strings: pairwise string comparison vs check existence in set

Question

I'm comparing a list of strings words_to_lookup to a list of strings word_list and for every match I'll do something.

I wonder whether direct string comparison or checking existence in a set will be more time efficient.

String comparison is O(n)

for w in word_list:
    for s in words_to_lookup:
        if s == w:
            # do something

Checking existence in a set is O(1) but getting the hash of the string (probably) takes O(n).

w_set = set(word_list)
for s in words_to_lookup:
    if s in w_set:
        # do something

So is string comparison an efficient approach? If the set approach is faster, how?

EDIT:

I was not thinking clearly when I first posted the question. Thank you for the comments. I find it hard to convert my real problem to a concise one suited for online discussions. Now I made the edit, the answer is obvious that the first approach is O(n^2) and the second approach is O(n).

My real question should have been this: Can hash tables really be O(1)?.

And the answer is:

The hash function however does not have to be O(m) - it can be O(1). Unlike a cryptographic hash, a hash function for use in a dictionary does not have to look at every bit in the input in order to calculate the hash. Implementations are free to look at only a fixed number of bits.

In your second code snippet, it will always be true, because you are seeing if the string is in the list where it came from. Did you mean to check if it was in a different list ...? — Larry the Llama, Jan 13 '22 at 20:19
What do you find if you measure the execution times for different inputs? — mkrieger1, Jan 13 '22 at 20:24
Also bear in mind that converting a list to a set will be O(n). — match, Jan 13 '22 at 20:24
Your requirements are a bit confusing. If you have a list and you want to do something for **every** match, converting to a set will ruin it because it will appear only once in the set. The second code doesn't make much logical sense, if you bother to loop the list, then why make a set — shriakhilc, Jan 13 '22 at 20:27
@LarrytheLlama Sorry for the mistake. It should be checking if s is in w_set. I've fixed it. — johan, Jan 13 '22 at 20:36
After this edit, I don't see the utility of `w`: it is not used. In every iteration of the loop the `if` condition will evaluate to the same outcome. So the line with `for` can be removed. What is `s`? Is it a constant? Both code snippets seem complicated ways to check `s in word_list`... — trincot, Jan 13 '22 at 20:46
@trincot s is a string. I'm doing other work for *each* match of s in word_list, which is why there is a for loop. `w` is indeed not used in the second code, maybe I should named it `_`. — johan, Jan 13 '22 at 20:57
Renaming does not help, the `if` condition is not impacted by the loop, so it just does the exact same thing over and over again. You may want to add the logic you have for `s` in the code snippets. It seems relevant. — trincot, Jan 13 '22 at 21:00
@trincot Thank you so much. I found where I confused myself and made another edit, which almost changed the entire question... Sorry about that. — johan, Jan 13 '22 at 21:29
It seems there is no more question, as the question also has the answer, taken from another question of which yours is then a duplicate. — trincot, Jan 13 '22 at 21:41

dvr · Answer 1 · 2022-01-13T21:46:44.910

1

EDIT Ok. Now I see what you mean. In that case, the set version is definitely better. Note, you can also do:

for s in words_to_lookup:
    if s in word_list:
       # do something

That's the same thing as your set way, but the running time of the "in" operator will be worse.

list - Average: O(n)
set/dict - Average: O(1), Worst: O(n)

So the set way is probably best.

edited Jan 13 '22 at 21:46

answered Jan 13 '22 at 20:29

dvr

370
1
9

Sorry for the mistake. It should be checking if s is in w_set. I've fixed it. – johan Jan 13 '22 at 20:37
But w_set is build from world list, so it will always be true. Perhaps post more code to give us some context as to what you are doing. – dvr Jan 13 '22 at 20:44
w in wset is always true but s in wset not necessarily. `s` is a string that might have matches in word_list, which is what I'm checking. – johan Jan 13 '22 at 20:50
I found where I confused myself and made another edit, which almost changed the entire question... Sorry about that. – johan Jan 13 '22 at 21:33
I see what you mean. Does my answer help you now? – dvr Jan 13 '22 at 21:46
Yes, thank you! – johan Jan 13 '22 at 21:48
Cool. note that with the set way, you do o(n) up front which results in total time complexity of 2*O(n). However, in the average case with the list lookup, you do O(n) with each iteration, so it's effectively a short hand for your nested loop version, which is O(n^2) – dvr Jan 13 '22 at 21:53

score 1 · Answer 2 · answered Jan 13 '22 at 20:45

If you only need to search once, your first approach is more efficient.

One advantage of constructing a set is the following: if you need to search against the same set many times, you only need to build the set once.

In other words, suppose you have N words in the dictionary (dictionary_list) and you have a list of M words that you want to look up (words_to_lookup). If you go with the set approach, the complexity is O(N+M). If you don't build a set, the complexity is O(N*M) because you may have to go over the whole dictionary of N words for each of the M words that you are looking up.

For this problem, the following code is the more efficient approach.

w_set = set(dictionary_list)
for w in words_to_lookup:
    if w in w_set:
        # do something

Compare strings in a list to another list of strings: pairwise string comparison vs check existence in set

2 Answers2