6

I have two sorted lists of integers. I would like to find all pairs of integers from the first and second list, respectively, that are within a certain distance of each other.

The naive approach is to check each pair, resulting in a O(N^2) time. I am sure there is a way to do it in O(N*logN) or maybe shorter.

In python, the naive O(N^2) approach is as follows:

def find_items_within(list1, list2, within):
    for l1 in list1:
        for l2 in list2:
            if abs(l1 - l2) <= within:
                yield (l1, l2)

Extra points for pythonic answers.

Application Note

I just wanted to point out the purpose of this little puzzle. I am searching a document and want to find all the occurrences of one term within a certain distance of another term. First you find the term vectors of both terms, then you can use the algorithms described below to figure out if they are within a given distance of each other.

speedplane
  • 15,673
  • 16
  • 86
  • 138

7 Answers7

7

There is no way to do it better then O(n^2) because there are O(n^2) pairs, and for within = infinity you need to yield all of them.


To find the number of these pairs is a different story, and can be done by finding the first index for each element e that suffices within-e < arr[idx]. The index idx can be found efficiently using binary search for example - which will get you O(nlogn) solution to find the number of these pairs.

It can also be done in linear time (O(n)), since you don't really need to do a binary search for all elements, after the first [a,b] range is found, note that for each other range [a',b'] - if a>a' then b>=b' - so you actually need to iterate the lists with two pointers and "never look back" to get a linear time complexity.

pseudo code: (for linear time solution)

numPairs <- 0
i <- 0
a <- 0
b <- 0
while (i < list1.length):
  while (a < i && list1[i] - list2[a] > within):
      a <- a+1
  while (b < list2.length && list2[b] - list1[i] < within):
      b <- b+1
  if (b > a):
      numPairs <- numPairs + (b-a)
  i <- i+1
return numPairs

(I made some fixes from the initial pseudo code - because the first one was aiming to find number of pairs within range in a single list - and not matches between two lists, sorry for that)

amit
  • 175,853
  • 27
  • 231
  • 333
  • (and for binary search - the OP should look at the `bisect` module) – Jon Clements Nov 29 '12 at 14:20
  • I think number of pairs can be done in O(n) - just find the window for the first number, and slide it for subsequent numbers. – nhahtdh Nov 29 '12 at 14:21
  • 1
    amit -- You don't need to test them all. since the data is sorted, once you get outside the distance you care about, you can safely break that loop. – mgilson Nov 29 '12 at 14:22
  • Suppose the number of pairs is not too big compared to the size of the list. Would it be possible to do it in `O(size_of_list x num_of_pairs)`? – speedplane Nov 29 '12 at 14:22
  • @nhahtdh -- That approach assumes that the numbers are equally spaced. (which isn't a given) here. – mgilson Nov 29 '12 at 14:22
  • All: I was thinking how to put this approach as clear as possible while you were commenting - see edit. What do you day? – amit Nov 29 '12 at 14:23
  • 1
    @mgilson: No need for the numbers to be equally spaced. The window can grow and shrink. – nhahtdh Nov 29 '12 at 14:24
  • @speedplane: Using one of the suggested approach in the answer - you can produce all the pairs. The first one will make you do it in `O(nlog(n) + num_pairs)` and the second in `O(n + num_pairs)` (just iterate all the elements in each range found) – amit Nov 29 '12 at 14:26
  • @nhahtdh -- The window can grow and shrink, but then you lose your linear complexity (in principle) because you need to search for your window again. – mgilson Nov 29 '12 at 14:28
  • 1
    @mgilson: Yes, but the 2 pointers will slide exactly n times. We just start from the previous position. – nhahtdh Nov 29 '12 at 14:39
  • @mgilson I don't think you lose linear time complexity: the most the window can slide is the length of the array. – vlad Nov 29 '12 at 14:39
6

This code is O(n*log(n)+m) where m is the size of the answer.

def find_items_within(l1, l2, dist):
    l1.sort()
    l2.sort()
    b = 0
    e = 0
    ans = []
    for a in l1:
        while b < len(l2) and a - l2[b] > dist:
            b += 1
        while e < len(l2) and l2[e] - a <= dist:
            e += 1
        ans.extend([(a,x) for x in l2[b:e]])
    return ans

In the worst case, it is possible that m = n*n, but if the answer is just a small subset of all possible pairs, this is a lot faster.

Thomash
  • 6,339
  • 1
  • 30
  • 50
  • Lots of good suggestions here, but I think this one is the easiest to understand, does not rely on anything too fancy and has algorithmic time as good or better than the others. Although @J.F. Sebastian's answer claims to be faster, I think it's the same when you factor in the O(log(n)) lookups it does. – speedplane Nov 30 '12 at 02:47
  • @speedplane: `set()` in Python has `O(1)` amortized lookups as the comments in my code explicitly say (think `unordered_set<>` in C++, not `set<>` with `O(log(n))`). btw, the time complexity is not better it is the same if `.sort()` is removed above (my answer assumes that the input is sorted as said in the first sentence of your question and `assert issorted()` statements in the code hint about it too). This answer is 2-3 times faster for the input I've tried on my machine. – jfs Nov 30 '12 at 19:07
  • @speedplane: btw, Thomash's answer can be made linear time if `ans.extend` is replaced with `yield i, (b,e)` where `l1[i] == a`. You will still need `O(n*n)` to enumerate all pairs explicitly, but you can find pairs (as in: know their indexes range) in `O(n)` for sorted input. – jfs Nov 30 '12 at 19:12
  • @J.F.Sebastian: It is not possible to have a linear time algorithm because the output is not linear. The best you can do is O(m) where m is the size of the output and that is the complexity of my algorithm if you consider the input to be sorted. – Thomash Nov 30 '12 at 20:20
  • @Thomash: have you read "You will still need `O(n*n)` (`O(m)` using your terminology) to enumerate all pairs explicitly" part of my comment? A single `yield i, (b, e)` gives you `e-b` pairs at once. – jfs Nov 30 '12 at 20:30
2

Here something with the same interface as you have given:

def find_items_within(list1, list2, within):
    i2_idx = 0
    shared = []
    for i1 in list1:
        # pop values to small
        while shared and abs(shared[0] - i1) > within: 
            shared.pop(0)
        # insert new values 
        while i2_idx < len(list2) and abs(list2[i2_idx] - i1) <= within:
            shared.append(list2[i2_idx])
            i2_idx += 1
        # return result
        for result in zip([i1] * len(shared), shared):
            yield result

for item in find_items_within([1,2,3,4,5,6], [3,4,5,6,7], 2):
    print item

Not very beautiful but it should do the trick in O(N*M), where N is the length of list1 and M the list of shared pairs per Item (given that the elements dropped and appended to shared is constant on the average).

Peter Schneider
  • 1,683
  • 12
  • 31
0

This seems to work:

from itertools import takewhile
def myslice(lst,start,stop,stride=1):
    stop = len(lst) if stop is None else stop
    for i in xrange(start,stop,stride):
        yield lst[i]

def find_items_within(lst1,lst2,within):
    l2_start = 0
    for l1 in lst1:
        try:
            l2_start,l2 = next( (i,x) for i,x in enumerate(myslice(lst2,l2_start,None),l2_start) if abs(l1-x) <= within )
            yield l1,l2
            for l2 in takewhile(lambda x:(abs(l1-x) <= within), myslice(lst2,l2_start+1,None)):
                yield l1,l2
        except StopIteration:
            pass


x = range(10)
y = range(10)
print list(find_items_within(x,y,2.5))
mgilson
  • 300,191
  • 65
  • 633
  • 696
0

Depending on the distribution of the values in your lists, you might be able to speed things up by using binning: take the range in which all your values fall (min(A+B), max(A+B)), and divide that range into bins of the same size as the distance D you’re considering. Then, to find all pairs, you only need to compare values within a bin or within adjacent bins. If your values are split up between many bins, this is an easy way to avoid doing M*N comparisons.

Another technique that might be just as easy in practice: Do a sort of bounded linear scan. Maintain an index into list A and into list B, starting from the beginning. On each iteration, advance the index into list A (start with the first element), call this element A0. Then, advance the index into list B. Remember the last value of B that’s less than A0-D (this is where we’ll want to start for the next iteration). But keep moving forward while you’re finding values between A0-D and A0+D — these are the pairs you’re looking for. As soon as the values in B become greater than A0+D, stop this iteration and start the next one — advance one element further into A, and start scanning B from the last place where B was < A0-D.

If you have, on average, a constant number of nearby pairs per element, I think this should be O(M+N)?

illya
  • 765
  • 5
  • 8
0

This method uses a dictionary whose keys are possible values of list2, and whose values are a list of values of list1 that are within distance of that value of list2.

def find_items_within(list1, list2, within):
    a = {}
    for l1 in list1:
        for i in range(l1-within, l1+within+1):
            if i not in a:
                a[i] = []
            a[i].append(l1)
    for l2 in list2:
        if l2 in a:
            for l1 in a[l2]:
                yield(l1, l2)

The complexity for this one is kind of goofy. for a list1 of size M and a list2 of size N and a within of size W, it's O(log(M*W) * (M*W + N)). In practice I think it works pretty well for small W.

Bonus: this works on unsorted lists too.

Kevin
  • 74,910
  • 12
  • 133
  • 166
0

You can find integers from list2 that are in [x - within, x + within] interval for all x from list1 in linear time (O(n)) using "scan line" technique (see How to Find All Overlapping Intervals and Sub O(n^2) algorithm for counting nested intervals?).

To enumerate corresponding intervals from list1 you need O(m) time where m is the number of intervals i.e., the overall algorithm is O(n*m):

from collections import namedtuple
from heapq import merge

def find_items_within(list1, list2, within):
    issorted = lambda L: all(x <= y for x, y in zip(L, L[1:]))
    assert issorted(list1) and issorted(list2) and within >= 0

    # get sorted endpoints - O(n) (due to list1, list2 are sorted)
    Event = namedtuple('Event', "endpoint x type")
    def get_events(lst, delta, type):
        return (Event(x + delta, x, type) for x in lst)
    START, POINT, END = 0, 1, 2 
    events = merge(get_events(list1, delta=-within, type=START),
                   get_events(list1, delta=within, type=END),
                   get_events(list2, delta=0, type=POINT))

    # O(n * m), m - number of points in `list1` that are 
    #               within distance from given point in `list2`
    started = set() # started intervals
    for e in events:  # O(n)
        if e.type is START: # started interval
            started.add(e.x) # O(m) is worst case (O(1) amortized)
        elif e.type is END: # ended interval
            started.remove(e.x)  # O(m) is worst case (O(1) amortized)
        else:  # found point
            assert e.type is POINT
            for x in started:  # O(m)
                yield x, e.x

To allow duplicate values in list1; you could add index for each x in Event and use a dictionary index -> x instead of the started set.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670