2

I want to compare two lists containing matching integers of different length. The goal is to make them the same length by removing items from the longer list based on missing values from the shorter list. The lists:

list1 = [101, 201, 301, 402, 502, 603, 701, 802, 904, 10012, 10021, 10033, 10041, 10054, 10062, 10071, 10082, 10093, 10101]
list2 = [102, 203, 504, 601, 703, 901, 10013, 10071, 10082, 10093, 10103]

However the matching values are not exactly the same for both list and can vary between 0 and 3 in this example.

The result would look like this:

resultlist1 = [101, 201, 502, 603, 701, 904, 10012, 10073, 10082, 10093, 10101]
resultlist2 = [102, 203, 504, 601, 703, 901, 10013, 10071, 10082, 10093, 10103]

removed_items_list1 = [2, 3, 7, 10, 11, 12, 13, 14]  # Index numbers of 

I tried the following without success

set(list1).intersection(list2)

Only returns exact matches

for i in xrange(len(list2)):
    if abs(list1[i] - list2[i]) > 3:
        del list1[i]

Does not remove all unwanted values

How would I compare these two lists with unequal length and remove non-matches (within a certain variation) in the longer list?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
cf2
  • 581
  • 1
  • 7
  • 17

2 Answers2

1

Here is a solution that takes linear time; the others take quadratic time, though that may be just fine if your input is small.

def align(shorter, longer, margin=3):    
    result = []
    removed = []

    longer = enumerate(longer)

    for target in shorter:
        while True:
            index, current = next(longer)
            if abs(current - target) <= margin:
                result.append(current)
                break
            else:
                removed.append(index)

    return result, removed

This assumes that you can always align the lists as in your example. If this isn't true you'll need to add some error checking to the above.

Example:

>>> align(list2, list1)
([101, 201, 502, 603, 701, 904, 10012, 10071, 10082, 10093, 10101],
 [2, 3, 7, 10, 11, 12, 13, 14])
jme
  • 19,895
  • 6
  • 41
  • 39
  • linear can still take longer than quadratic for certain length. So it would be interesting to compare the performance of the different methods, if someone has the time :) – dnalow Oct 25 '16 at 14:57
  • Very true, thus the "if the input is small" disclaimer. But ultimately a list comprehension is only marginally faster than the equivalent `for`-loop in Python, so the hidden constants for the above solution will be comparable to those for the comprehension solution, at least. I don't have any tests, but my guess is that you'll start to see a real difference when the lists are a few hundred entries long. – jme Oct 25 '16 at 15:01
  • I like your solution. I use a list of around 1200 entries. However when I use your function I get a StopIteration error: index, current = next(longer) – cf2 Oct 25 '16 at 15:07
  • @cf2 This means that your lists aren't aligned like your example. In other words, we can't make them the same length by just taking the entries from the larger until they are within the margin of the lower. For example, this input would give an error: `longer = [1, 2, 9, 9, 9, 9]` and `shorter = [1, 2, 3]`, with a margin of `3`. – jme Oct 25 '16 at 15:10
0

You can use numpy array comparison:

list1 = [101, 201, 301, 402, 502, 603, 701, 802, 904, 10012, 10021, 10033, 10041, 10054, 10062, 10071, 10082, 10093, 10101]
list2 = [102, 203, 504, 601, 703, 901, 10013, 10071, 10082, 10093, 10103]
import numpy as np
l1 = np.array(list1)
l2 = np.array(list2)

ind = abs(l1 - l2[:,None]) <= 3

print l1[ind.max(0)]
print l2[ind.max(1)]
print ind.max(1)
print ind.max(0)
print np.where(~(ind.max(0)))

results in

[  101   201   502   603   701   904 10012 10071 10082 10093 10101]

[  102   203   504   601   703   901 10013 10071 10082 10093 10103]

[ True  True  True  True  True  True  True  True  True  True  True]

[ True  True False False  True  True  True False  True  True False False
 False False False  True  True  True  True]

(array([ 2,  3,  7, 10, 11, 12, 13, 14]),)
dnalow
  • 974
  • 4
  • 14