0

I have a list of combination of words like "brown fox", and bunch of sentences to check. I just want to find how many times the elements from the list occur in the sentence.

I have a working solution but I want to make it faster. So I just want to have an opinion or any alternative way to do things.

Nothing is case sensitive.

The solution I have works well when my list of keywords is small. What if my list of keywords is 80 elements and my sentence is only two or three words? It will be slow. Is there any way to improve the solution?

harry_line = "The Dark Lord Voldemort is 
    shooting another shooter who claimed to be Dark Lord."
keywords = ['Dark Lord', 'shooter', 'plan', 'poncho', 'brown fox', 'ugly cake piece']

print(sum(harry_line.count(phrase) for phrase in keywords))

In above example Dark Lord is occuring twice and shooter once thus giving output of 3 which is correct.

rafaelc
  • 57,686
  • 15
  • 58
  • 82
Ryan Cyrus
  • 309
  • 5
  • 14
  • What exactly is the problem? You're just trying to make it more efficient? – jhpratt Aug 02 '18 at 00:42
  • yes, I just want it to be more efficient. or if we can find any other way to solve it? – Ryan Cyrus Aug 02 '18 at 00:43
  • `pastry_line` and `short_line` are not used? – rafaelc Aug 02 '18 at 00:44
  • @RafaelC Those lines are just for demonstrations. They just show that length of sentences may vary. – Ryan Cyrus Aug 02 '18 at 00:45
  • You could make a count map of values from the sentence, and then sum the keys in your array. That would make your processing time closer to O(N) – David Culbreth Aug 02 '18 at 00:54
  • @DavidCulbreth How would that work? Can you please explain? – Ryan Cyrus Aug 02 '18 at 00:57
  • 1
    If your sentence is only two or three words, scanning for 80 elements is still pretty trivial work. If you need to scan thousands or millions of sentences, with thousands of phrases to check for, [it can be worth optimizing](https://stackoverflow.com/a/34820772/364696), but `O(m*n)` is still pretty trivial when `m` and `n` are three digits or smaller, with minimal constant cost factors. – ShadowRanger Aug 02 '18 at 00:59
  • @ShadowRanger Well for now I'm not checking those many sentences but thanks for pointing out the the answer. It is really helpful. Does that mean I can stick with the current solution without getting looked down by experts? – Ryan Cyrus Aug 02 '18 at 01:11
  • @ShadowRanger raises a good point: that if performance isn't a problem, then don't pre-optimize. However, assuming your data set is large enough, then parallelizing this would also be a good next step. – David Culbreth Aug 02 '18 at 01:12
  • @DavidCulbreth Performance can be a issue later. How would I pre-optimize then? If I have 50K sentences and 120 keywords? How should I approach the problem then? – Ryan Cyrus Aug 02 '18 at 01:16
  • 2
    @RyanCyrus: The scenario with 50K sentences and 120 keywords is around the level at which Aho-Corasick (described in [my linked answer](https://stackoverflow.com/a/34820772/364696)) becomes likely to help, reducing 50K * 120 = 6M scans down to 50K scans at the small memory and precompute cost of building the Aho-Corasick automaton. But again, only do this is you think there is a high probability of actually encountering that sort of input size; taking external dependencies and complicating your code, even a little, isn't worth it unless testing shows simple code is too slow on *real* data. – ShadowRanger Aug 02 '18 at 01:22
  • 1
    Look up threading. Split the sentence into however many threads you have, and do the same search you have in your question in each thread, on each respective portion of the string. Then add all the partial sums. Really just read up on multiprocessing. The Python Multiprocessing library will allow you to utilize all of the physical cores on your machine. Edit: I'm on mobile right now. I'll write up an answer when I get to my laptop, if no one steals the idea first. – David Culbreth Aug 02 '18 at 01:22
  • 2
    As a reminder, in that linked answer's question, the OP was performing *billions* of scans in about 40 minutes; performing single digit millions of scans would take single digit *seconds*, so even for 50K sentences/120 keywords, your savings only matter if waiting for, say, five seconds is unacceptable. Aho-Corasick could probably drop that five seconds to a small fraction of a second, but will anyone notice or care? If the answer is "No", keep it simple. – ShadowRanger Aug 02 '18 at 01:27
  • @ShadowRanger Thank you so much for explaining. Helps alot. I wish I could mark this discussion as an answer. – Ryan Cyrus Aug 02 '18 at 01:44
  • @DavidCulbreth I will definitely think about multiprocessing if things go out of hands. Thank you for the explanation. – Ryan Cyrus Aug 02 '18 at 01:45

2 Answers2

4

Because the OP would like a real answer, a simple list of possibilities, in order in which they should be tried:

  1. Use the naive solution
  2. No, seriously, use the naive solution; you don't have nearly enough needles and haystacks to make any optimization worthwhile. Haystacks and needles in the two digit range can be scanned on a 20 year old graphing calculator faster than you can blink; on any reasonably modern hardware, you should be able to search thousands of haystacks for hundreds of needles faster than even the most impatient human would notice.
  3. Really, you sure? For single digit billions of scans performed the naive way, in one example case it took 40 minutes; if you're doing less than millions of scans the naive way, you're in the low single digit seconds cost range. Try using the naive solution, and find the biggest realistic set of inputs you're likely to encounter and figure out how long it takes. Is it long enough to matter? No? Use the naive solution.
  4. Sigh... Alright, you tried naive, and it was too slow. Perhaps consider Aho-Corasick? It will reduce one scan per "needle" per "haystack" down to a precompute step to make an Aho-Corasick automaton, followed by a single scan per "haystack" no matter how many needles are being searched for. If that's still not enough, consider using the multiprocessing module to parallelize the Aho-Corasick scans.
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
0

If you're looking for speed, you might try making a count dict of the words, then summing the words in your list/tuple.

search_words = ['enter', 'your', 'search']
sentence = "enter your sentence here"
counts = dict()
for word in sentence.split():
    if word in counts.keys():
        counts[word] += 1
    else:
        counts[word] = 1
total=0
for word in search_words:
    if word in counts.keys():
        total += counts[word]
print(total)

This method will only be O(n), or maybe O(n×log(n)) instead of O(n^2) that your nice little one-liner does. It exploits the near-constant lookup time of the dict type.

David Culbreth
  • 2,610
  • 16
  • 26
  • You are using single words as list element. However, in my case I can have "enter your" as a single keyword. You code won't work in that situation or will it? – Ryan Cyrus Aug 02 '18 at 01:08
  • I realize that this specific example won't be able to find things like "Dark Lord", but something like this may be able to parse large files faster than the admittedly very nice one-liner that was already in the question. – David Culbreth Aug 02 '18 at 01:09
  • Exactly, I agree with you. For single words you code is perfect. Even I wrote pretty similar at first but then this specific example came up which forced me to think for something different. – Ryan Cyrus Aug 02 '18 at 01:13