0

I have a large set of long text documents with punctuation. Three short examples are provided here:

doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]

and I have sets of words like the following:

wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])

I want to search all text documents that meet the following condition:

(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)

The or not condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.

This process would match the first text in doc but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".

I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Forinstance
  • 413
  • 4
  • 17
  • 1
    Depending on your actual task sizes, it might be faster still to turn the sentences into bags (sets) of words; then the comparison becomes something like `(w & wAND) == wAND and (wOR < w) and (not w & wNOT)`... – AKX May 18 '22 at 11:22
  • Will the words in `wAND`, etc. always consist of letters/digits/underscores? – Wiktor Stribiżew May 18 '22 at 11:24
  • Thanks a lot @AKX, I tried that but the performance improvement was not super. Any other solution faster? Thanks for the suggestion so far! – Forinstance May 18 '22 at 11:25
  • @Wiktor yes, they will be letters, digits and underscores. Not including punctuation or anything stranger. – Forinstance May 18 '22 at 11:27
  • @Forinstance You should probably tell us the approaches you've tried, and your measured speeds for all of them. – AKX May 18 '22 at 11:28
  • Because of `wAND` that should be matched in any order, regex is not going to be a faster option. – Wiktor Stribiżew May 18 '22 at 11:30
  • Thanks @AKX, I did not try any other approach. The performance increase of your solution was something like 5% faster but I will test it again and be back if I have more precise value. I did not post it immediately as the other condition was a bit easier to undertand for me :) – Forinstance May 18 '22 at 11:31
  • By the way, the code as illustrated does _not_ split the documents to words – if the document contains `nearby`, that `near` would be found in there. Is that on purpose? – AKX May 18 '22 at 11:34
  • @AKX I was intentionally omitting the function to clean and split the docs, as this is a very very long function. But I was intending that I clean the text, remove punctuation (including hashtags), tranform to lowercase and split words into a list, later transformed into a set. In this way I avoid the problem you say as "near" should not match "nearby". I made the same observation as a comment to autumnalblake answer, to be sure this is not the case. – Forinstance May 18 '22 at 11:37
  • I... don't believe it should be a very long function. I think something like `def clean_doc(doc): return re.findall(r'\w+', doc.lower())` should suffice. – AKX May 18 '22 at 11:39
  • Well it is much longer as I do some additional cleaning and use complex Trie() function for improving speed, similarly to what is presented here: https://stackoverflow.com/questions/66024709/python-regex-fast-replace-of-multiple-keywords-with-punctuation-and-starting-w . However, for the purpose of this question, we can just consider a solution like the one you mention that is perfectly fine. I have no speed problem in cleaning (I would like to avoid it if possible but OK to keep it if there is no faster solution). What I was focused in improving here was the condition that slows down things. – Forinstance May 18 '22 at 11:45

2 Answers2

2

Your solution appears to be linear in the length of the document - you won't be able to get any better than this without sorting, as the words you're looking for could be anywhere in the document. You could try using one loop over the entire doc:

or_satisfied = False
for w in doc:
    if word in wAND: wAND.remove(word)
    if not or_satisfied and word in wOR: or_satisfied = True
    if word in wNOT: return False
return or_satisfied and not wAND
autumnalblake
  • 61
  • 1
  • 3
1

You can build regexps for the word bags you have, and use them:

def make_re(word_set):
    return re.compile(
        r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
        flags=re.I,
    )


wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)

def re_match(doc):
    if not wOR_re.search(doc):
        return False
    if wNOT_re.search(doc):
        return False
    found = set()
    expected = len(wAND)
    for word in re.finditer(r'\w+', doc):
        found.add(word)
        if len(found) == expected:
            break
    return len(found) == expected

A quick timetest seems to say this is 89% faster than the original (and passes the original "test suite"), likely clearly due to the fact that

  • documents don't need to be cleaned (since the \bs limit matches to words and re.I deals with case normalization)
  • regexps are run in native code, which tends to always be faster than Python
name='original'      iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match'      iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match'     iters=10000 time=0.203 iters_per_sec=49363.58

where bag_match is my original comment suggestion of using set intersections:

def bag_match(doc):
    bag = set(clean_doc(doc))
    return (
        (bag.intersection(wOR) or not wOR) and
        (bag.issuperset(wAND) or not wAND) and
        (not bag.intersection(wNOT) or not wNOT)
    )

If you already have cleaned the documents to an iterable of words (here I just slapped @lru_cache on clean_doc, which you probably wouldn't do in real life since your documents are likely to all be unique and caching wouldn't help), then bag_match is much faster:

name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Thanks much for all the explanation and tests, this really helps! I did some additional tests and indeed, among your solutions, bag_match seems to be the best. This has also the advantage to allow me more complex cleaning. One last question, as I have difficulties with time testing (sorry I am new :) ). Do you think that the bag_match solution is faster than the one proposed by autumnalblake? – Forinstance May 18 '22 at 13:14
  • bag_match (added my implementation above) seems to be a bit faster than theirs (53769 iter/s vs 48286 iter/s) with this data. – AKX May 18 '22 at 14:46
  • Thanks a lot for testing this further. Very last question: I suppose there is no performance difference between your first implementation of bag_match (that of your first comment) with that added to this question right? I am now using the solution in your first comment, not this very last one. That seems to work even without the "or not" statements. – Forinstance May 18 '22 at 17:29