I need to make this function run much faster (~20x faster) to meet a required benchmark. I have made quite a few improvements from my initial implementation, but am hitting a wall.
The basic problem is this: count case-insensitive occurrences of word
in text
.
Complicating criteria include:
- Must be a whole word (
word
"George" is not found intext
"Georges") - Single quotes are to be considered part of words, unless there are more than one in a row
word
may actually be a phrase (meaning it could have spaces, punctuation, etc.)- Cannot use regex
My basic implementation has been to loop through each character in text
, maintaining my position in word
and if the character matches the corresponding character of word
, I add it to a local string, advance my position in word
and text
, and go again. Once I have a match candidate (i.e. my local string is equal to word
), I check the surrounding characters to ensure the match candidate is a full word, per rules 1 and 2 above. Note this checking does not occur frequently enough to materially impact the total time the algorithm takes.
Most successful optimizations I have made so far:
- Do string lowercasing and length measuring outside the loop
- Check that
word
is at least a substring oftext
otherwise return 0 immediately - Don't bother checking for full word potential until we have a full match
- Count the number of occurrences up front (with no rules), and break out of the loop immediately if we meet that number
I've profiled the code line-by-line using pprofile, and the majority of my code's runtime is simple lines like incrementing the counter var, resetting the match_candidate
string to "", indexing into strings, and if statements. I have not included the code for validate_full_match
as it is not a significant time user.
Are there any low-hanging fruit I'm ignoring? A different approach entirely I should consider?
Thanks for any suggestions!
def count_occurences_in_text(word, text):
"""Number of occurences of word (case insensitive) in text
Note that word can actually be any length of text, from a single
character to a complete phrase; however, partial words do not
count. For example:
count_occurences_in_text("george", "I am Georges") returns 0
while
count_occurences_in_text("i am", "I am Georges") returns 1
"""
# We perform some measurements and manipulation at the start to
# avoid performing them repeatedly in the loop below
text = text.lower()
word = word.lower()
max_matches = text.count(word)
if max_matches == 0:
return 0
word_len = len(word)
# Counter vars
match_count = 0
text_cursor = 0
word_cursor = 0
# We will build up match_candidate and check it against word
match_candidate = ""
for text_char in text:
if text_char == word[word_cursor]:
match_candidate += text_char
if word == match_candidate:
if validate_full_match(text, text_cursor, word_len):
match_count += 1
if match_count == max_matches:
break
word_cursor = 0
match_candidate = ""
else:
word_cursor += 1
else:
match_candidate = ""
word_cursor = 0
text_cursor += 1
return match_count