How to remove strings containing certain words from list FASTER

Question

There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything']. The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing']. This can be achieved using the following list comprehension:

newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]

Note that if not word in sentence is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, e.g. word is embedded in swordsmith, and every and thing are embedded in everything.

However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words. Applying the list comprehension mentioned takes hours! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list. Any suggestions? Maybe using regex?

Make `lexicon` a `set` and reverse the search. There's no need to iterate the entire lexicon for each word in the sentence — roganjosh, Feb 25 '20 at 18:35
could you post an answer with minimal working example? I'll compare the runtime to find the fastest solution — hyhno01, Feb 25 '20 at 18:39
I'm on a phone so it's not very easy for me to do that. I might try but it's a lot of faffing and probably someone will answer before I manage it — roganjosh, Feb 25 '20 at 18:40
I'm no Python expert but I can only assume that `sentence.split(' ')` is expensive since it creates an array for every single sentence. Also it would fail if a sentence has any level of punctuation such as a comma or period; so not only is it slow, it is incorrect. You could choose to turn each word into a regex such as `\bword\b` and apply it against each sentence using `re.sub()`. Not sure about performance but at least your logic would behave as prescribed. — MonkeyZeus, Feb 25 '20 at 18:42
@roganjosh. I'm on mobile, so didn't see your exchange with OP until after I posted :( — Mad Physicist, Feb 25 '20 at 18:45
Apparently I was wrong anyways given that one of the answers presents a .2 second solution and it uses `.split()`. Good thing I prefaced with "I'm no Python expert" :-) — MonkeyZeus, Feb 25 '20 at 18:55

Mad Physicist · Accepted Answer · 2020-02-25T18:55:36.147

3

Do your lookup in a set. This makes it fast, and alleviates the containment issue because you only look for whole words in the lexicon.

lexicon = set(lexicon)
newlist = [s for s in sentences if not any(w in lexicon for w in s.split())]

This is pretty efficient because w in lexicon is an O(1) operation, and any short-circuits. The main issue is splitting your sentence into words properly. A regular expression is inevitably going to be slower than a customized solution, but may be the best choice, depending on how robust you want to be against punctuation and the like. For example:

lexicon = set(lexicon)
pattern = re.compile(r'\w+')
newlist = [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]

edited Feb 25 '20 at 18:55

answered Feb 25 '20 at 18:44

Mad Physicist

107,652
25
181
264

yeah, I figured that also out after exchange with roganjosh. Let you know the runtime in a sec – hyhno01 Feb 25 '20 at 18:47
Thanks for writing it up :) upvoted. I'm not fussed about rep and glad my suggestion was enough to make you reconsider the approach, @hyhno01. It'll be orders of magnitude faster – roganjosh Feb 25 '20 at 18:49
runtime for len(lexicon) = 179.918 and len(sentences) = 241.564: 0.2079s. I'll keep the thread open a while to see whether there are faster solutions (although this is already really fast) – hyhno01 Feb 25 '20 at 18:51
I've added some notes about how to do regex – Mad Physicist Feb 25 '20 at 18:57
I've removed my answer (as you were totally right > inefficient) and studied your answer. Worked a treat. Simple question. I noticed it is case sensitive. Any way to circumvent for future endevours? Do we have to upper everything for example or is there a text-compare parameter of some sort? – JvdV Feb 25 '20 at 19:09
2

@JvdV. `lexicon = set(w.lower() for w in lexicon)` and `m.group().lower() in lexicon` or `w.lower in lexicon`, as necessary. Use `.casefold()` instead of `.lower()` if you want a more aggressive normalization than just lowercasing. – Mad Physicist Feb 25 '20 at 19:13

mathfux · Answer 2 · 2020-02-25T22:52:54.520

1

You can optimize three things here:

<>Convert lexicon to set in order not to make in operation costless.

lexicon = set(lexicon)

<>Check intersection of sentence with lexicon in a most efficient way. It should use set operations. It was discussed here about performance of set intersection.

[x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]

<> Use filter instead of list comprehension.

list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

Final code:

lexicon = set(lexicon)
list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

Results

def removal_0(sentences, lexicon):
    lexicon = set(lexicon)
    pattern = re.compile(r'\w+')
    return [s for s in sentences if not any(m.group() in lexicon for m in pattern.finditer(s))]

def removal_1(sentences, lexicon):
    lexicon = set(lexicon)
    return [x for x in sentences if set(x.split(' ')).isdisjoint(lexicon)]

def removal_2(sentences, lexicon):
    lexicon = set(lexicon)
    return list(filter(lambda x: set(x.split(' ')).isdisjoint(lexicon), sentences))

%timeit removal_0(sentences, lexicon)
%timeit removal_1(sentences, lexicon)
%timeit removal_2(sentences, lexicon)

9.88 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.19 µs ± 55.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
2.76 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note. So it seems filter is little bit slower but I don't know reasons yet.

edited Feb 25 '20 at 22:52

answered Feb 25 '20 at 18:54

mathfux

5,759
1
14
34

1

I'm dubious about your second claim. `any` will short-circuit so I think that would need to be run through `timeit` (it's certainly not intuitive to me that this will be a faster approach) – roganjosh Feb 25 '20 at 18:56
Can you check runtime? – mathfux Feb 25 '20 at 18:57
Not at the moment because I'm on a mobile, but I think it's something you can check – roganjosh Feb 25 '20 at 18:58
```AttributeError: 'set' object has no attribute 'intersect'``` – hyhno01 Feb 25 '20 at 19:03
You're right. I have removed it as inefficient still and updated my current solution. – mathfux Feb 25 '20 at 20:09
Can you add the timings? I'm curious – roganjosh Feb 25 '20 at 22:14
@ roganjosh Updated. – mathfux Feb 25 '20 at 22:53
1

Thanks. I have a feeling that the sample size is too small to get a proper indication but I'll have to look tomorrow – roganjosh Feb 25 '20 at 23:10

How to remove strings containing certain words from list FASTER

2 Answers2