For a "pure" in/not in check, use set.intersection
. Creating a set
of the bigger text (if you can hold it in memory) will speed up this task tremendously.
A set reduces the amount of words to be checkt to unique checks and the check itself is O(1) - that is about as fast as you can get:
from urllib.request import urlopen
# use from disc, else get once from url and save to disc to use it
try:
with open("faust.txt") as f:
data = f.read()
except:
# partial credit: https://stackoverflow.com/a/46124819/7505395
# get some freebe text - Goethes Faust should suffice
url = "https://archive.org/stream/fausttragedy00goetuoft/fausttragedy00goetuoft_djvu.txt"
data = urlopen(url).read()
with open("faust.txt", "wb") as f:
f.write(data)
Process the data for measurements:
words = data.split() # words: 202915
unique = set(words) # distinct words: 34809
none_true = {"NoWayThatsInIt_1", "NoWayThatsInIt_2", "NoWayThatsInIt_3", "NoWayThatsInIt_4"}
one_true = none_true | {"foul"}
# should use timeit for it, havent got it here
def sloppy_time_measure(f, text):
import time
print(text, end="")
t = time.time()
# execute function 1000 times
for _ in range(1000):
f()
print( (time.time() - t) * 1000, "milliseconds" )
# .intersection calculates _full_ intersection, not only an "in" check:
lw = len(words)
ls = len(unique)
sloppy_time_measure(lambda: none_true.intersection(words), f"Find none in list of {lw} words: ")
sloppy_time_measure(lambda: one_true.intersection(words), f"Find one in list of {lw} words: ")
sloppy_time_measure(lambda: any(w in words for w in none_true),
f"Find none using 'in' in list of {lw} words: ")
sloppy_time_measure(lambda: none_true.intersection(unique), f"Find none in set of {ls} uniques: ")
sloppy_time_measure(lambda: one_true.intersection(unique), f"Find one in set of {ls} uniques: ")
sloppy_time_measure(lambda: any(w in unique for w in one_true),
f"Find one using 'in' in set of {ls} uniques: ")
Outputs for 1000 applications of the search (added spacing for clarity):
# in list
Find none in list of 202921 words: 5038.942813873291 milliseconds
Find one in list of 202921 words: 4234.968662261963 milliseconds
Find none using 'in' in list of 202921 words: 9726.848363876343 milliseconds
# in set
Find none in set of 34809 uniques: 15.897989273071289 milliseconds
Find one in set of 34809 uniques: 11.409759521484375 milliseconds
Find one using 'in' in set of 34809 uniques: 39.183855056762695 milliseconds