Let me be upfront, I code for fun and this is a code challenge I've been working on the last couple of days in my spare time. The challenge is that I am given a bunch of words separated by spaces (document) and then a few search terms in a list. I have to find a spot in the document where those searchTerms are the closest. Basically, find the smallest subset of the document that contains all of the searchTerms and output that subset. So far, my function seems to be working on my system. However, when I upload, I am being told that my algorithm takes too long to execute. My thought process was to find every instance of a searchTerm in the document and then run itertools.product() against it. I then test each one to determine which one is the shortest based on index values. Here is what I have so far:
def answer(document, searchTerms):
from itertools import product
#build a list of the input document
document = document.split()
index = []
#find all indexes for the searchTerms and build a list of lists
for w in searchTerms:
index.append([i for i,x in enumerate(document) if x == w])
#build iterator of all possible combinations of indexes for each of the searchTerms
combinations = product(*index)
#recover memory
del index
#build tuple of minimum distance between all search terms
shortest = min(((max(x) - min(x),x) for x in combinations),key=lambda x: x[0])
return (' '.join(document[min(shortest[1]):max(shortest[1])+1]))
I tried to use multiprocessing to speed up sections of my code but haven't quite gotten the syntax right. For example:
from multiprocessing import Pool
p = Pool(processes=2)
shortest = p.map(min_max,combinations)
def min_max(combinations):
return min(((max(x) - min(x),x) for x in combinations))
Results in:
Traceback (most recent call last):
File "./searchTerms2.py", line 65, in <module>
print (answer(document,searchTerms))
File "./searchTerms2.py", line 45, in answer
shortest = p.map(min_max,combinations)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
TypeError: 'int' object is not iterable
Any pointers would be greatly appreciated. Are there better way to attack this problem? Are there areas where I could be more efficient?
--EDIT-- Further explanation of the problem:
document = 'this is a song that never ends it goes on and on my friend some people started singing it not knowing what it was and they will continue singing it forever just because this is the song'
searchTerms = ['this', 'goes','on']
should result in:
'this is a song that never ends it goes on'
This works in my current algorithm but not near quick enough if given a much larger document and searchTerms. I hope this is clearer...
I've been timing my code and it looks like my biggest performance hit is coming from:
shortest = min(((max(x) - min(x),x) for x in combinations),key=lambda x: x[0])
As I increase the number of words in 'document' and add additional searchTerms in 'searchTerms' I see a big peformance hit on that line. Everything else vary's very little from what I can tell..