2

There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything']. The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing']. This can be achieved using the following list comprehension:

newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]

Note that if not word in sentence is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, e.g. word is embedded in swordsmith, and every and thing are embedded in everything.

However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words. Applying the list comprehension mentioned takes hours! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list. Any suggestions? Maybe using regex?

hyhno01
  • 177
  • 8
  • 2
    Make `lexicon` a `set` and reverse the search. There's no need to iterate the entire lexicon for each word in the sentence – roganjosh Feb 25 '20 at 18:35
  • could you post an answer with minimal working example? I'll compare the runtime to find the fastest solution – hyhno01 Feb 25 '20 at 18:39
  • 2
    I'm on a phone so it's not very easy for me to do that. I might try but it's a lot of faffing and probably someone will answer before I manage it – roganjosh Feb 25 '20 at 18:40
  • 2
    I'm no Python expert but I can only assume that `sentence.split(' ')` is expensive since it creates an array for every single sentence. Also it would fail if a sentence has any level of punctuation such as a comma or period; so not only is it slow, it is incorrect. You could choose to turn each word into a regex such as `\bword\b` and apply it against each sentence using `re.sub()`. Not sure about performance but at least your logic would behave as prescribed. – MonkeyZeus Feb 25 '20 at 18:42
  • 1
    @roganjosh. I'm on mobile, so didn't see your exchange with OP until after I posted :( – Mad Physicist Feb 25 '20 at 18:45
  • Apparently I was wrong anyways given that one of the answers presents a .2 second solution and it uses `.split()`. Good thing I prefaced with "I'm no Python expert" :-) – MonkeyZeus Feb 25 '20 at 18:55

2 Answers2

3

Do your lookup in a set. This makes it fast, and alleviates the containment issue because you only look for whole words in the lexicon.

lexicon = set(lexicon)
newlist = [s for s in sentences if not any(w in lexicon for w in s.split())]

This is pretty efficient because w in lexicon is an O(1) operation, and any short-circuits. The main issue is splitting your sentence into words properly. A regular expression is inevitably going to be slower than a customized solution, but may be the best choice, depending on how robust you want to be against punctuation and the like. For example:

lexicon = set(lexicon)
pattern = re.compile(r'\w+')
newlist = [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • yeah, I figured that also out after exchange with roganjosh. Let you know the runtime in a sec – hyhno01 Feb 25 '20 at 18:47
  • Thanks for writing it up :) upvoted. I'm not fussed about rep and glad my suggestion was enough to make you reconsider the approach, @hyhno01. It'll be orders of magnitude faster – roganjosh Feb 25 '20 at 18:49
  • runtime for len(lexicon) = 179.918 and len(sentences) = 241.564: 0.2079s. I'll keep the thread open a while to see whether there are faster solutions (although this is already really fast) – hyhno01 Feb 25 '20 at 18:51
  • I've added some notes about how to do regex – Mad Physicist Feb 25 '20 at 18:57
  • I've removed my answer (as you were totally right > inefficient) and studied your answer. Worked a treat. Simple question. I noticed it is case sensitive. Any way to circumvent for future endevours? Do we have to upper everything for example or is there a text-compare parameter of some sort? – JvdV Feb 25 '20 at 19:09
  • 2
    @JvdV. `lexicon = set(w.lower() for w in lexicon)` and `m.group().lower() in lexicon` or `w.lower in lexicon`, as necessary. Use `.casefold()` instead of `.lower()` if you want a more aggressive normalization than just lowercasing. – Mad Physicist Feb 25 '20 at 19:13
1

You can optimize three things here:

<>Convert lexicon to set in order not to make in operation costless.

lexicon = set(lexicon)

<>Check intersection of sentence with lexicon in a most efficient way. It should use set operations. It was discussed here about performance of set intersection.

[x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]

<> Use filter instead of list comprehension.

list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

Final code:

lexicon = set(lexicon)
list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

Results

def removal_0(sentences, lexicon):
    lexicon = set(lexicon)
    pattern = re.compile(r'\w+')
    return [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]

def removal_1(sentences, lexicon):
    lexicon = set(lexicon)
    return [x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]

def removal_2(sentences, lexicon):
    lexicon = set(lexicon)
    return list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

%timeit removal_0(sentences, lexicon)
%timeit removal_1(sentences, lexicon)
%timeit removal_2(sentences, lexicon)

9.88 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.19 µs ± 55.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.76 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note. So it seems filter is little bit slower but I don't know reasons yet.

mathfux
  • 5,759
  • 1
  • 14
  • 34