2

I know that similar questions have been asked several times, but my problem is a bit different and I am looking for a time-efficient solution, in Python.

I have a set of words, some of them end with the "*" and some others don't:

words = set(["apple", "cat*", "dog"])

I have to count their total occurrences in a text, considering that anything can go after an asterisk ("cat*" means all the words that start with "cat"). Search has to be case insensitive. Consider this example:

text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".

I would like to get a final score of 4 (= cat* x 2 + dog + apple). Please note that "cat*" has ben counted twice, also considering the plural, whereas "apple" has been counted just once, as its plural is not considered (having no asterisk at the end).

I have to repeat this operation on a large set of documents, so I would need a fast solution. I don't know if regex or flashtext could reach a fast solution. Could you help me?

EDIT

I forgot to mention thas some of my words contain punctuation, see here for e.g.:

words = set(["apple", "cat*", "dog", ":)", "I've"])

This seems to create additional problems when compiling the regex. Is there some integration to the code you already provided that would work for these two additional words?

Forinstance
  • 413
  • 4
  • 17
  • Does this answer your question? [Speed up millions of regex replacements in Python 3](https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3) – Nick Dec 04 '20 at 10:59
  • Thanks Nick this only partially does, but very useful question. I think the answer is the one you and @Dani gave. I am now trying to understand which appoach would be faster. – Forinstance Dec 04 '20 at 11:52
  • 1
    It will really depend on the size of your `words` list - for a large one the Trie solution will probably be faster, but for something short the cost of building the Trie might outweigh the benefit of speeding up the regex compare. It would be interesting to know what you find for your specific example. – Nick Dec 05 '20 at 00:12

3 Answers3

4

You can do this with regex, creating a regex out of the set of words, putting word boundaries around the words but leaving the trailing word boundary off words that end with *. Compiling the regex should help performance:

import re

words = set(["apple", "cat*", "dog"])
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"

regex = re.compile('|'.join([r'\b' + w[:-1] if w.endswith('*') else r'\b' + w + r'\b' for w in words]), re.I)
matches = regex.findall(text)
print(len(matches))

Output:

4
Nick
  • 138,499
  • 22
  • 57
  • 95
1

DISCLAIMER: I'm the author of trrex

For this problem, if you really want a scalable solution, use a trie regex instead of a union regex. See this answer for an explanation. One approach is to use trrex, for example:

import trrex as tx
import re

words = {"apple", "cat*", "dog"}
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS"

prefix_set = {w.replace('*', '') for w in words if w.endswith('*')}
full_set = {w for w in words if not w.endswith('*')}

prefix_pattern = re.compile(tx.make(prefix_set, right=''), re.IGNORECASE)  # '' as we only care about prefixes
full_pattern = re.compile(tx.make(full_set), re.IGNORECASE)

res = prefix_pattern.findall(text) + full_pattern.findall(text)
print(res)

Output

['cat', 'CAT', 'apple', 'dog']

For a particular use of trrex, see this, the experiments described over there yield a 10 times improvement over the naive union regex. A trie regex take advantages of common prefixes and creates an optimal regular expression, for the words:

['baby', 'bat', 'bad']

it creates the following:

ba(?:by|[td])
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Thank a lot for this solution. I also very much liked the one from @Nick. Is your new code going to be faster than NIck's one? How faster? Thanks again! – Forinstance Dec 04 '20 at 10:03
0

Create a a Trie for the words you want to search.

Then iterate over the characters of the string you want to check.

  • Each time you reached a leaf in the tree, increase the counter and skip to the next word.
  • Each time there is no path, skip to the next word.
napuzba
  • 6,033
  • 3
  • 21
  • 32