I needed to do that in a performance-critical environment, so I benchmarked all the possible variants I could find and think of with Python 3.11. Here are the results:
words =['test', 'èk', 'user_me', '<markup>', '[^1]']
def find_words(words):
for word in words:
if "_" in word or "<" in word or ">" in word or "^" in word:
pass
def find_words_2(words):
for word in words:
for elem in [">", "<", "_", "^"]:
if elem in word:
pass
def find_words_3(words):
for word in words:
if re.search(r"\_|\<|\>|\^", word):
pass
def find_words_4(words):
for word in words:
if re.match(r"\S*(\_|\<|\>|\^)\S*", word):
pass
def find_words_5(words):
for word in words:
if any(elem in word for elem in [">", "<", "_", "^"]):
pass
def find_words_6(words):
for word in words:
if any(map(word.__contains__, [">", "<", "_", "^"])):
pass
> %timeit find_words(words)
351 ns ± 6.24 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit find_words_2(words)
689 ns ± 15.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit find_words_3(words)
2.42 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_4(words)
2.75 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_5(words)
2.65 µs ± 176 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> %timeit find_words_6(words)
1.64 µs ± 28.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
- The naive chained
or
approach wins (function 1)
- The basic iteration over each element to test (function 2) is at least 50% faster than using
any()
, and even a regex search is faster than the basic any()
without map()
, so I don't get why it exists at all. Not to mention, the syntax is purely algorithmic so any programmer will understand what it does, even without Python background.
re.match()
only searches for patterns starting at the beginning of the line (which is confusing if you come from PHP/Perl regex), so to make it work like PHP/Perl, you need to use re.search()
or to tweak the regex to include characters before, which comes with a performance penalty.
If the list of substrings to search for is known at programming time, the ugly chained or
is definitely the way to go. Otherwise, use the basic for
loop over the list of substrings to search. any()
and regex are a loss of time in this context.
For a more down-to-earth application (searching if a file is an image by looking for its extension in a list):
def is_image(word: str ) -> bool:
if ".bmp" in word or \
".jpg" in word or \
".jpeg" in word or \
".jpe" in word or \
".jp2" in word or \
".j2c" in word or \
".j2k" in word or \
".jpc" in word or \
".jpf" in word or \
".jpx" in word or \
".png" in word or \
".ico" in word or \
".svg" in word or \
".webp" in word or \
".heif" in word or \
".heic" in word or \
".tif" in word or \
".tiff" in word or \
".hdr" in word or \
".exr" in word or \
".ppm" in word or \
".pfm" in word or \
".nef" in word or \
".rw2" in word or \
".cr2" in word or \
".cr3" in word or \
".crw" in word or \
".dng" in word or \
".raf" in word or \
".arw" in word or \
".srf" in word or \
".sr2" in word or \
".iiq" in word or \
".3fr" in word or \
".dcr" in word or \
".ari" in word or \
".pef" in word or \
".x3f" in word or \
".erf" in word or \
".raw" in word or \
".rwz" in word:
return True
return False
IMAGE_PATTERN = re.compile(r"\.(bmp|jpg|jpeg|jpe|jp2|j2c|j2k|jpc|jpf|jpx|png|ico|svg|webp|heif|heic|tif|tiff|hdr|exr|ppm|pfm|nef|rw2|cr2|cr3|crw|dng|raf|arw|srf|sr2|iiq|3fr|dcr|ari|pef|x3f|erf|raw|rwz)")
extensions = [".bmp", ".jpg", ".jpeg", ".jpe", ".jp2", ".j2c", ".j2k", ".jpc", ".jpf", ".jpx", ".png", ".ico", ".svg", ".webp", ".heif", ".heic", ".tif", ".tiff", ".hdr", ".exr", ".ppm", ".pfm", ".nef", ".rw2", ".cr2", ".cr3", ".crw", ".dng", ".raf", ".arw", ".srf", ".sr2", ".iiq", ".3fr", ".dcr", ".ari", ".pef", ".x3f", ".erf", ".raw", ".rwz"]
(Note that the extensions are declared in the same order in all variants).
> %timeit is_image("DSC_blablabla_001256.nef") # found
536 ns ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit is_image("DSC_blablabla_001256.noop") # not found
923 ns ± 43.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit IMAGE_PATTERN.search("DSC_blablabla_001256.nef")
221 ns ± 24.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit IMAGE_PATTERN.search("DSC_blablabla_001256.noop") # not found
207 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit any(ext in "DSC_blablabla_001256.nef" for ext in extensions) # found
1.53 µs ± 30.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
> %timeit any(ext in "DSC_blablabla_001256.noop" for ext in extensions) # not found
2.2 µs ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
With a lot more options to test, regex are actually faster and more legible (for once…) than the chained or
. any()
ist still the worst.
Empiric tests show that the performance threshold is at 9 elements to test:
- below 9 elements, chained
or
is faster,
- above 9 elements, regex
search()
is faster,
- at exactly 9 elements, both run around 225 ns.