1

Let me be upfront, I code for fun and this is a code challenge I've been working on the last couple of days in my spare time. The challenge is that I am given a bunch of words separated by spaces (document) and then a few search terms in a list. I have to find a spot in the document where those searchTerms are the closest. Basically, find the smallest subset of the document that contains all of the searchTerms and output that subset. So far, my function seems to be working on my system. However, when I upload, I am being told that my algorithm takes too long to execute. My thought process was to find every instance of a searchTerm in the document and then run itertools.product() against it. I then test each one to determine which one is the shortest based on index values. Here is what I have so far:

def answer(document, searchTerms):
    from itertools import product

    #build a list of the input document
    document = document.split()

    index = []
    #find all indexes for the searchTerms and build a list of lists 
    for w in searchTerms:
        index.append([i for i,x in enumerate(document) if x == w])

    #build iterator of all possible combinations of indexes for each of the searchTerms
    combinations = product(*index)

    #recover memory
    del index

    #build tuple of minimum distance between all search terms
    shortest  = min(((max(x) - min(x),x) for x in combinations),key=lambda x: x[0])

    return (' '.join(document[min(shortest[1]):max(shortest[1])+1]))

I tried to use multiprocessing to speed up sections of my code but haven't quite gotten the syntax right. For example:

from multiprocessing import Pool
p = Pool(processes=2)
shortest = p.map(min_max,combinations)

def min_max(combinations):
    return min(((max(x) - min(x),x) for x in combinations))

Results in:

Traceback (most recent call last):
File "./searchTerms2.py", line 65, in <module>
print (answer(document,searchTerms))
File "./searchTerms2.py", line 45, in answer
shortest = p.map(min_max,combinations)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
TypeError: 'int' object is not iterable

Any pointers would be greatly appreciated. Are there better way to attack this problem? Are there areas where I could be more efficient?

--EDIT-- Further explanation of the problem:

document = 'this is a song that never ends it goes on and on my friend some people started singing it not knowing what it was and they will continue singing it forever just because this is the song'

searchTerms = ['this', 'goes','on']

should result in:

'this is a song that never ends it goes on'

This works in my current algorithm but not near quick enough if given a much larger document and searchTerms. I hope this is clearer...

I've been timing my code and it looks like my biggest performance hit is coming from:

shortest  = min(((max(x) - min(x),x) for x in combinations),key=lambda x: x[0])

As I increase the number of words in 'document' and add additional searchTerms in 'searchTerms' I see a big peformance hit on that line. Everything else vary's very little from what I can tell..

Ron Wellman
  • 387
  • 3
  • 10
  • If you are aiming for speed, any line with `#recover memory` should probably not be there unless you are dealing with multi-GB arrays that might send you into virtual memory. – Mad Physicist Mar 07 '16 at 12:56
  • Rather than doing `index.append([i for i,x in enumerate(document) if x == w])` for each search term, do `index.append([i for i,x in enumerate(document) if x in searchTerms])`. Also, turn `searchTerms` into a `set` for faster lookup. – Mad Physicist Mar 07 '16 at 12:58
  • I will turn this into an answer since that is probably what is tripping you up. – Mad Physicist Mar 07 '16 at 13:01
  • Regarding multiprocessing: `p.map` applies `min_max` to each element of `combinations`. It seems you are expecting the iterator itself. – Roland W Mar 07 '16 at 13:17
  • @Mad Physicist `index.append([i for i,x in enumerate(document) if x in searchTerms])` results in one list of index values that does not differentiate between which searchTerm produced that index as apposed to the same searchTerm being found in multiple places. This means the distance won't zero me in on the spot in the document that contains all of the searchTerms as closest together as possible. Am I missing something here? – Ron Wellman Mar 07 '16 at 14:29
  • The question you asked did not make that clear at all. Imagine reading your question from the point of view of someone who has never looked at the original problem. If you have to quote the text of the challenge, that would be perfectly fine. – Mad Physicist Mar 07 '16 at 14:35
  • OK. So let me know if this makes sense: You are looking for the shortest segment that contains all search terms. – Mad Physicist Mar 07 '16 at 14:51
  • @MadPhysicist Yes. I apologize for not making that clearer. – Ron Wellman Mar 07 '16 at 14:54
  • @RonWellman. Thanks for the clarification. I think you should add that to your question. I will try to come up with something. – Mad Physicist Mar 07 '16 at 16:00
  • @MadPhysicist I updated the question to indicate searching for the smallest subset of the document that contains all of the searchTerms. It read correctly in my mind. Again, my bad.. I timed each of the major components within the function and I'm seeing a big performance hit when trying to find the shortest subset toward the end. The index list comprehension and the building of my iterator works smoothly but then when I actually have to crunch through the results, I hit a wall. Apparently using a list comprehension inside of the min() function isn't helping when facing a large iterator. – Ron Wellman Mar 07 '16 at 16:25
  • Your iterator is enormous, but that is not something to be proud of. I am drafting a response for something much faster and without the fancy iteration. – Mad Physicist Mar 07 '16 at 16:51
  • May I use Python3 in the response or is this Py2 only? – Mad Physicist Mar 07 '16 at 16:54
  • @MadPhysicist The system that my code must run against is 2.7. – Ron Wellman Mar 07 '16 at 17:42
  • @RonWellman. Thanks. I thought that `min(..., key=...)` is a Python 3-only syntax. Turns out, it works for 2.7 as well. – Mad Physicist Mar 07 '16 at 17:57

2 Answers2

1

I've been thinking about this problem for a day now and I find it pretty interesting. It helps to think of the "document" as a line and each "word" as a point on the line. Then, any solution is a window/range covering a part of this line with a left side (start) and right side (end).

Attempted Solution: The reason Mad Physicist's solution does not work is that it starts with a point on this line and treats the distance between this point and every other point as orthogonal when they actually contain a lot of overlap. It chooses only the closest point of each matching search word, which limits the solution space that is searched and therefore some solutions are missed. It is not hard to find an example, such as:

document = 'a x x d x x x a j x x'
searchTerms = 'a d j'.split()

It starts with d and then selects the closest a, when the further a will yield the shorter overall solution.

Brute Force Solution: Your solution in the question uses product to generate possible solutions and checks each one. This is fine and actually pretty fast for small problems like the example you also post, but as the length of doc grows and especially the number of search terms, the number of combinations from product grows rapidly.

New Solution: The first thing we can do is realize that any combination that doesn't include all of the points between its minimum and maximum indices is invalid. This eliminates a lot of combinations and makes it so you are effectively only choosing combinations of (start, end) points, regardless of number of search words.

While there's probably some fancy mathematical formula for generating these combinations, I took a different approach...

If you consider the indices of each search word individually as mini-windows from their lowest index to highest, it is clear that the lower bound on the end index of the solution is the maximum start index of all of these ranges. This is because any window with an end index lower will not include this search word. The lower bound on the start index is simply the lowest match index.

This (start,end) must be a solution so we use it as an initial guess, then follow this procedure:

It helps to create a flat list of all matching indices and work on this list, since all other indices are irrelevant. In this case start = 0.

Advanced the start index to the next match index (start++ on the flat list). This pops the leftmost match out of the window. Take the next index for that match that is not less than start. If this index is already within the range, than we have reduced a redundant match and have another solution. If this index is outside the range on the right, move end to expand the range to include this match again. If there are no more of this match available, then we have run out of solutions.

Repeat this process until there are no more solutions, keeping track of which solution produces the shortest range = end - start. This is the final solution.

Testing:

To get a little more variety in the tests and verify my solution was producing the same solutions as the original, I randomly grab k search terms out of document:

import random
terms = random.sample(document.split(), random.randint(3,5))
print(terms)
s1 = answer_product(document, terms)
s2 = answer_window(document, terms)

assert s1 == s2

And then to try and do a simple benchmark I used:

import timeit
N = 1000
for j in xrange(2,8):
    terms = random.sample(document.split(), j)
    print(N,j)
    print('window: %s s'%timeit.timeit(lambda: answer_window(document*4, terms), number=N))
    print('product: %s s'%timeit.timeit(lambda: answer_product(document*4, terms), number=N))

On my computer, for the small case of N=1000,k=2, they are both very quick around t~=0.03s. As k grows to k=7, though, the time it takes for answer_product grows to t>20s while answer_window is still t~=0.03s. Note, since I don't have an actual 'document' for testing, I just multiplied the example one by 4 to increase the amount of searching.

╔═════╦═══╦═══════════════════╦════════════════════╦═══════╗ ║ N ║ k ║ answer_window (s) ║ answer_product (s) ║ p/w ║ ╠═════╬═══╬═══════════════════╬════════════════════╬═══════╣ ║ 1e3 ║ 2 ║ 0.0231 ║ 0.0347 ║ 1.5 ║ ║ 1e3 ║ 3 ║ 0.0227 ║ 0.058 ║ 2.55 ║ ║ 1e3 ║ 4 ║ 0.025 ║ 0.242 ║ 9.68 ║ ║ 1e3 ║ 5 ║ 0.0326 ║ 3.044 ║ 93.4 ║ ║ 1e3 ║ 6 ║ 0.035 ║ 11.55 ║ 330 ║ ║ 1e3 ║ 7 ║ 0.0299 ║ 23.82 ║ 797 ║ ║ 1e5 ║ 2 ║ 2.2 ║ 2.524 ║ 1.15 ║ ║ 1e5 ║ 3 ║ 2.195 ║ 2.743 ║ 1.25 ║ ║ 1e5 ║ 4 ║ 3.272 ║ 46.51 ║ 14.2 ║ ║ 1e5 ║ 5 ║ 3.74 ║ 67.71 ║ 18.1 ║ ║ 1e5 ║ 6 ║ 3.52 ║ 1137 ║ 323 ║ ║ 1e5 ║ 7 ║ 3.98 ║ 4519 ║ 1135 ║ ╚═════╩═══╩═══════════════════╩════════════════════╩═══════╝

Code:

def answer_window(doc, terms):
    doc = doc.split()
    # create a grouping of indices by match and a flat array of all match
    # indices
    index = {w:[] for w in terms}
    indices = []
    j = 0
    for (i, w) in enumerate(doc):
        if w in index:
            # save real doc indices in flat array and use indices into that
            # array to simplify stepping.  both are automatically ordered
            indices.append(i)
            index[w].append(j)
            j += 1
    # find the maximum leftmost match index.  this is the lower bound on the
    # right side of the solution window
    highest_min = max(v[0] for v in index.values())

    # start with lowest minimum index (first) and highest minimum index (which
    # is the lower bound on the right side). this must be a solution.
    # then look for a shorter one by stepping the left side, replacing lost
    # matches from the right (expanding when necessary) until the left cannot
    # be advanced anymore.  this will cover all possible solution windows and the
    # one with the shortest length is saved
    start, end = 0, highest_min
    sol = start, end
    dsol = indices[sol[1]]-indices[sol[0]]
    while True:
        # pop leftmost match
        pop = doc[indices[start]]
        start += 1
        # need to make sure we still have the match we popped in the range
        for j in index[pop]:
            if j >= start:
                # another copy to the right!
                if j > end:
                    # must expand end to include the replacement
                    end = j
                    if indices[end]-indices[start] < dsol:
                        # new window is shorter than sol
                        sol = start, end
                        dsol = indices[sol[1]]-indices[sol[0]]
                elif indices[end]-indices[start] < dsol:
                    # the replacement is already inside the range, and moving
                    # the left side made the window smaller than sol
                    sol = start,end
                    dsol = indices[sol[1]]-indices[sol[0]]
                break # done with this pop
            else:
                # this match is left of our window
                pass
        else:
            # found no replacement, can't shrink left side anymore so we are
            # out of solutions
            break
    return (' '.join(doc[indices[sol[0]]:indices[sol[1]]+1]))
bj0
  • 7,893
  • 5
  • 38
  • 49
0

The main slowdown in your code comes from the fact that you look for all combinations of indices between the different words. Clearly, most of those combinations will not be even remotely eligible for the shortest run. Here is one algorithm that should run a bit faster:

  1. Find indices of all the search terms, separated by search term (as you are doing)
  2. Use the term with the fewest matches as your base.
  3. For every occurrence of the base term, find the nearest of every other term (note that this is much faster than computing the distance for every combination). The maximum spread of the nearest neighbors is the length of a candidate run.
  4. Find the shortest candidate run and return it.

This version uses dictionary comprehensions and the key parameter to min. It does not, however, use anything outside the __builtins__ module. The code below runs under Python 2.7 and 3.5:

def answer(document, searchTerms):
    #build a list of the input document
    document = document.split()

    # construct list of indices of occurrences for each term
    indices = {w: [i for i,x in enumerate(document) if x == w] for w in searchTerms}

    # find the least frequent term and isolate it
    leastFrequent = min(indices.keys(), key=lambda x: len(indices[x]))
    loopIndex = indices[leastFrequent]
    del indices[leastFrequent]

    # for each element of leastFrequent, compute the nearest distance to each other item
    candidates = [None] * len(loopIndex)
    for index, element in enumerate(loopIndex):
        neighbors = [None] * len(indices)

        # find the distance to the nearest neighbor in each other list
        for ind, term in enumerate(indices):
            neighbors[ind] = min(indices[term], key=lambda x, e=element: abs(x - e))

        # the run length is the maximum of the maximum and element minus the minimum of the minimum and element
        start = min(min(neighbors), element)
        end = max(max(neighbors), element) + 1
        length = end - start
        candidates[index] = length, start, end

    # get the shortest candidate segment
    winner = min(candidates, key=lambda x: x[0])
    return ' '.join(document[winner[1]:winner[2]])

If there are s search terms that occur on (geometric) average k times each, this algorithm will run in approximately O(k * s * k) = O(s * k^2) time. The factors of k come from the loop over element and the call to min inside it. The factor of k comes from the loop over term. By taking the least frequent element as the base, we are reducing one of the k terms significantly. Especially for the case where one of the terms only appears once, it is guaranteed to be in every possible combination, so the outer loop only runs once.

For comparison, your implementation uses itertools.product, which produces s nested loops, each of which runs for k iterations. That makes for approximately O(k^s) runtime.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Wow. I love the thought process here. I'm still digesting each of the elements here but a side by side comparison shows that your code is way faster that what I wrote! I couldn't wrap my head around how to avoid having to test each and every combination which becomes a huge problem as I scale up the document and searchTerms. I really appreciate all the work. It's definitely helping my education! – Ron Wellman Mar 07 '16 at 20:21
  • @RonWellman. I was glad for the brain teaser and you seemed to have put so much work into it already, I was glad to be able to help out. – Mad Physicist Mar 08 '16 at 04:05
  • Also, I just fixed a major typo: not->note in step #3 – Mad Physicist Mar 08 '16 at 04:07
  • I truly appreciate it. I'm modifying your code slightly for a couple of edge cases that I believe are causing me issues. Obviously I can't tell what the code is being tested against so I have to use my imagination. Modifying for (1) only one searchTerm given (causes crash) (2) same searchTerm given more than once which doesn't play nice with your dictionary (3) etc. Again, I truly appreciate the insight. Your code has provided a unique perspective on attacking problems of this nature. Mine was akin to brute-force by finding every possible combination! – Ron Wellman Mar 08 '16 at 16:54
  • While clever, this answer doesn't actually work for all cases. Try ```document='a x x d x x x a j x x j j'; searchTerms=['a', 'd', 'j']``` for a counter example. Ron's code gives the correct answer of ```d x x x a j``` but this answer yields ```a x x d x x x a j```. – bj0 Mar 08 '16 at 19:16
  • @bj0. You are absolutely right I had not considered the possibility of a more distant neighbor being included in a run created by a different term. I do not have time to fix it right now, but I will later. I the meantime, I will make this community wiki. Awesome user name BTW. – Mad Physicist Mar 08 '16 at 19:26