Python string occurence count regex performance

Question

I was asked to find the total number of substring (case insensitive with/without punctuations) occurrences in a given string. Some examples:

count_occurrences("Text with", "This is an example text with more than +100 lines") # Should return 1
count_occurrences("'example text'", "This is an 'example text' with more than +100 lines") # Should return 1
count_occurrences("more than", "This is an example 'text' with (more than) +100 lines") # Should return 1
count_occurrences("clock", "its 3o'clock in the morning") # Should return 0

I chose regex over .count() as I needed an exact match, and ended up with:

def count_occurrences(word, text):
    pattern = f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))"
    return len(re.findall(pattern, text, re.IGNORECASE))

and I've got every matching count but my code took 0.10secs while expected time is 0.025secs. Am I missing something? is there any better (performance optimised) way to do this?

I already have every matchings, the question is to know if there is any better way to do it. As that can bring the execution time to the expected 0.25secs — , Dec 23 '20 at 07:51
Regex is usually much more than what one needs. If you choose regex only for case insensitive matching, `text.lower().count(word.lower())` is much faster. Do you need another regex? Or, you could find messy but more specifically optimized code. — ghchoi, Dec 23 '20 at 07:59
See my example above, Its mixed(case, punctuations, brackets) etc.. if I go with `.count` lets say `txt = "texts texts texts'` count will return 3 if I search for `text` and I dont want that (It needs to return a match only for exact word) — , Dec 23 '20 at 08:06
Are you open to solutions that are not based on regex and will be (probably) much faster? — Tomer Shetah, Dec 27 '20 at 23:12
From your examples I gather that "exact match" for you includes if there are `()` characters around the string but not if there is a `'` before the string. That's not very extensive. What about double quotes? Block brackets like `[]`? etc. Seeing as you're mostly looking at performance, I can see that the look-behind (that's the `(?<!...)` part) is rather expensive and you could look at using the `(?:s1|s2|s3)(*SKIP)(*F)|whatYouWant` methods described here: https://stackoverflow.com/a/23589204/2684660 — asontu, Dec 29 '20 at 22:26

Tomer Shetah · Accepted Answer · 2020-12-30T19:03:25.360

OK, I was struggling to make it work without regexes, as we all know that regexes are slow. Here is what I came up with:

def count_occurrences(word, text):
    spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__"]
    endings = spaces + ['?', '.', '!', ',', ')', '"', '»']
    s = text.lower().split(word.lower())
    l = len(s)
    return sum((
            (i == 0 and (s[0] == '' or any(s[i].endswith(t) for t in spaces)) and (s[1] == '' or any(s[i+1].startswith(t) for t in endings))) 
            or (i == l - 2 and any(s[i].endswith(t) for t in spaces) and (s[i+1] == '' or any(s[i+1].startswith(t) for t in endings)))
            or (i != 0 and i != l - 2 and any(s[i].endswith(t) for t in spaces) and any(s[i+1].startswith(t) for t in endings))
        ) for i in range(l - 1))

The whole file runs in ideone:

Ran 1 test in 0.025s

OK

Which is what the question is asking for.

The logic is pretty simple. Let's split the text by word, both lower cased. Now let's look at each couple of neighbours. If, for example index 0 finished with a valid delimiter, and index 1 starts with a valid delimiter, let's count it as an occurrence. Let's do that up to the last couple of the split.

Since performance is important here, we have to be aware to the order of spaces and endings. We are basically looking for the first in the list to fulfil the condition. So it is important to locate the variables that are more common first. For example, If I declare:

spaces = ['(', '«', '\u201d', '\u201c', ':', "''", "__", '\n', ' ']

instead of what I have in my solution, I get a run of 0.036 seconds.

If for example I declare one array:

spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__", '?', '.', '!', ',', ')', '"', '»']

which has all delimiters and use only that, I get 0.053 seconds. Which is 60% more than my solution.

It is probably possible that there is a better solution with declaring the delimiters in another order.

This looks like some effort and performs good. Let's see, if it satisfies OP! Plus from me. However, I don't generally agree, that *regexes are slow* :) It depends. — bobble bubble, Dec 30 '20 at 21:51
Wow it's fast ! I thought regex was faster, you proved me wrong :), I'll accept this as answer as it fulfils my performance requirement. Thank you !! — , Dec 31 '20 at 07:59

ilov3 · Answer 2 · 2020-12-23T08:27:15.830

1

If the words are you searching is defined and finite, regex pre-compilation through re.compile could help to make things faster. Something like:

search_words = [
  'foo',
  'bar',
  'baz',
]

words_to_re = {w: re.compile(f"(?<![a-z])((?<!')|(?<='')){w}(?![a-z])((?!')|(?=''))") for w in search_words}

def count_occurrences(word, text):
    regex = words_to_re[word]
    return len(regex.findall(text))

edited Dec 23 '20 at 08:27

answered Dec 23 '20 at 08:01

ilov3

427
2
7

`re.compile(f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))", re.IGNORECASE)` tried but the performance is bad. it took +1 sec – Dec 23 '20 at 08:18
You have to precompile searching patterns once and then reuse them on your function. – ilov3 Dec 23 '20 at 08:22
1

No my text is not predefined, its comes through function parameter `count_occurences("different word", "different text to compare +100 line")` every time – Dec 23 '20 at 08:37
Precompiling was more important some time ago. Last seen, the `re` module uses an LRU cache for recent regexes, so compilation happens only once anyway. – Charles Merriam Dec 29 '20 at 22:50

MIARD · Answer 3 · 2020-12-26T14:25:16.107

You can turn all word to lowercase manually using string.lower() function. Check this maybe this will help you:

def count_occurrences2(word, text):
    text = text.lower()
    word = word.lower()
    pattern = f"(?<![a-z])((?<!')|(?<='')){word}(?![a-z])((?!')|(?=''))"
    return len(re.findall(pattern, text))

I have inspected execution time using timeit library:

import timeit

def checkTime(word, text, function):
  now = timeit.default_timer()
  function("more than", lines)
  return timeit.default_timer()-now

text = "This is an example 'text' with (more than) +100 lines "*1000
word = "more than"
time_0 = checkTime("more than",text, count_occurrences)
time_1 = checkTime("more than",text, count_occurrences2)
print(time_0)
print(time_1)
print(time_1 < time_0) //true

Edit:

This is another way:

def count_occurences_in_text(word, text):
    pattern = r"(?<![a-z])((?<!')|(?<=''))"+str(word.lower())+"(?![a-z])((?!')|(?=''))"
    line_now = text.lower()
    count = 0
    search = re.search(pattern, line_now)
    while search:
        count +=1
        line_now = line_now[search.span()[1]:]
        search = re.search(pattern,line_now)
    return count

Edit 2:

This function will pass all assertion in your code(considering execution time):

def count_occurences_in_text(word, text):
    text = text.lower()
    word = word.lower()
    word_len = len(word)
    text_len = len(text)
    if not (word[0] >= 'a' and word[0] <= 'z') :
        word = word[1:word_len]
        if not (word[len(word) - 1] >= 'a' and word[len(word) - 1] <= 'z') :
            word = word[1:len(word)-1]
    count = 0
    index = 0
    have = [' ', ",","!","?",".","\n",":"]
    haveP = [' ',':']
    if word_len > text_len:
        return 0;
    while index < text_len-word_len+1:
        if text[index:index+word_len] == word:
            if index != 0:
                prev_word = text[index-1]
                # if (prev_word >= 'a' and prev_word <= 'z') or prev_word == "'":
                if prev_word not in haveP:
                    if index > 1 and text[index-1] =="'" and text[index-2]=="'":
                        count+=1
                        index += word_len+1
                        continue
                    if index > 1 and text[index-1] =="_" and text[index-2]=="_":
                        count+=1
                        index += word_len+1
                        continue
                    else:
                        index += 1
                        continue
            if index + word_len <= text_len-1:
                last_word = text[index+word_len]
                # if (last_word >= 'a' and last_word <= 'z') or last_word == "'":
                if last_word not in have:
                    if index+word_len <= text_len-2 and last_word =="'" and text[index+word_len+1]=="'":
                        count +=1
                        index += word_len+1
                        continue
                    if index+word_len <= text_len-2 and last_word =="_" and text[index+word_len+1]=="_":
                        count +=1
                        index += word_len+1
                        continue
                    else:
                        index += 1
                        continue
            count += 1
            index += word_len+1
        index += 1
    return count

It helps but not entirely, have you seen the example file(link) in my question ? — , Dec 25 '20 at 16:27
yes, Here: [follow this link](https://ideone.com/Z8sTxH), I have tried another way, — MIARD, Dec 25 '20 at 21:26
Your first method is performing better than the second, but my execution time `0.1728` still not under `0` though — , Dec 26 '20 at 07:32
I have tried another approach and successfully passed all assertion, but the execution time is not meet your expectation. Still, you can check this [follow this link](https://ideone.com/X37RJH) — MIARD, Dec 26 '20 at 14:14

score 0 · Answer 4 · answered Oct 23 '21 at 20:27

use a regular expression split

def count_occurrences(search_word,text):
    alist=re.split(r'\s+',text)
    matches=[word for word in alist if word==search_word]
    return len(matches)

count_occurrences("clock", "its 3o'clock in the morning")

output

score -1 · Answer 5 · edited Dec 31 '20 at 10:31

The first error was using the words "Python" and "Performance" in the same sentence. Python is heavily oriented in the direction of "cheap--develop code quickly" with "good--runs as expected" as doable. Fast is right out. Any advice here is strictly implementation-dependent.

You can clean up your regular expressions. I recommend using the group shortcut \b (word boundary). In all cases, I strongly recommend playing with your regex interactively in regex101 or equivalent.
You can write your own search function in Python. It will be slower by running in Python and faster by skipping storage of matches and other generality.
You can compare your speeds against simple string .count(). You will need to use lower() and decide if the word "this" matches "sthisany" or not.
You can modify your test function to actually have a hundred lines, e.g., text = (text + '\n')*100.
You can use PyPi, which will generally speed up your execution by trading away some start-uptime and some meta-programming flexiblity.
You can write a small snippet of C code and learn to call it from Python. Fun on its own.

I recommend you keep notes and comparisons and turn those in with your homework, not just the final product.

Python string occurence count regex performance

5 Answers5