I have a large set of long text documents with punctuation. Three short examples are provided here:
doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]
and I have sets of words like the following:
wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])
I want to search all text documents that meet the following condition:
(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)
The or not
condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.
This process would match the first text in doc
but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".
I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?