I know that similar questions have been asked several times, but my problem is a bit different and I am looking for a time-efficient solution, in Python.
I have a set of words, some of them end with the "*" and some others don't:
words = set(["apple", "cat*", "dog"])
I have to count their total occurrences in a text, considering that anything can go after an asterisk ("cat*" means all the words that start with "cat"). Search has to be case insensitive. Consider this example:
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".
I would like to get a final score of 4 (= cat* x 2 + dog + apple). Please note that "cat*" has ben counted twice, also considering the plural, whereas "apple" has been counted just once, as its plural is not considered (having no asterisk at the end).
I have to repeat this operation on a large set of documents, so I would need a fast solution. I don't know if regex or flashtext could reach a fast solution. Could you help me?
EDIT
I forgot to mention thas some of my words contain punctuation, see here for e.g.:
words = set(["apple", "cat*", "dog", ":)", "I've"])
This seems to create additional problems when compiling the regex. Is there some integration to the code you already provided that would work for these two additional words?