Longest common subsequence of two datasets with tolerance

Question

I have four 1D np.arrays: x1, y1, x2, y2, where x1 and y2 has the same length, also x2 and y2 has the same length, since they are corresponding x and y values for a dataset. len(x1) and len(x2) are always different. Let's assume len(x1) > len(x2) for now. These two arrays always have common values, but in a special way: the values are not exactly the same, only within a tolerance (because of numerical errors, etc.). Example with tolerance = 0.01:

x1 = np.array([0, 1.01, 1.09, 1.53, -9.001, 1.2, -52, 1.011])
x2 = np.array([1, 1.1, 1.2, 1.5, -9, 82])

I want to keep only the common values (in the tolerance manner). Use the shorter array for reference, which is x2 in this case. The first value in x2 is 1, and has a corresponding value in x1, which is 1.01. Next: 1.2 has also a corresponding value in x2, 1.2. The value 1.5 has no corresponding value, because 1.53 is out of tolerance, so filter it out, etc.. The full result should be:

x1 = np.array([1.01, 1.09, -9.001, 1.2])
x2 = np.array([1, 1.1, -9, 1.2])

To bring this one step further, based on filtering the x values this way I want to filter the y values for the same indices for both datasets, so in other words I want to find the longest common subsequence of two datasets. Note that ordering is important here because of the connection with the y values (it doesn't matter if we argsort x, and reindex x and y with that first).

What I have tried based on this answer:

def longest_common_subseq(x1, x2, y1, y2, tol=0.02):
    # sort them first to keep x and y connected
    idx1 = np.argsort(x1)
    x1, y1 = x1[idx1], y1[idx1]
    idx2 = np.argsort(x2)
    x2, y2 = x2[idx2], y2[idx2]
    
    # here I assumed that len(x2) < len(x1)
    idx = (np.abs(x1[:,None] - x2) <= tol).any(axis=1)
    
    return x1[idx], x2[idx], y1[idx], y2[idx]

the y values can be arbitrary in this case, only the shapes must match with x1 and x2. For example:

y1 = np.array([0, 1, 2, 3, 4, 5, 6, 7])
y2 = np.array([-1, 0, 3, 7, 11, -2])

Trying to run the function above raises

IndexError: boolean index did not match indexed array along dimension 0.

I understand: The index array's length is wrong because x1 and x2 have different length, and so far I couldn't do it. Is there a nice way to achieve this?

EDIT:

If multiple values are inside the tolerance, the closest should be selected.

The task is not defined well enough. What happens if `x1=[1, 1.02, 1.03]` and `x2=[1.021, 1.017]`? Multiple answer are within `tol=0.02`, and the same answer in `x1` is closest to different elements in `x2`. What would be a correct pairing here? — Aguy, Aug 02 '20 at 15:34
@Aguy That's almost surely not going to happen, because `x1` and `x2` were originally the same arrays, one of them just goes through slicing and certain operations. The case you mentioned the correct answer is `x1=[1.02]`, `x2=[1.021]`. We use the shorter array for reference, going through its elements, see if there's any corresponding value in `x2` for the desired `tol`, and select the closest. Once we selected a value, we need to exclude it afterwards to avoid duplication. I know this sound complicated, I agree that it's not defined well enough, maybe I'll rephrase the question later. — Péter Leéh, Aug 02 '20 at 16:25

score 1 · Answer 1 · answered Aug 03 '20 at 05:04

I think this should do the trick:

def longest_common_subseq(x1, x2, y1, y2, tol=0.02):
    # sort them first to keep x and y connected
    idx1 = np.argsort(x1)
    x1, y1 = x1[idx1], y1[idx1]
    idx2 = np.argsort(x2)
    x2, y2 = x2[idx2], y2[idx2]
    
    # here I assumed that len(x2) < len(x1)
    difference = np.abs(x1[:,None] - x2) <= tol
    no_multiples = difference.cumsum(axis=0).cumsum(axis=0) == 1
    out_idx1 = no_multiples.any(axis=1)
    out_idx2 = no_multiples.any(axis=0)
    return x1[out_idx1], x2[out_idx2], y1[out_idx1], y2[out_idx2]

Breaking that down, this block of code

difference = np.abs(x1[:,None] - x2) <= tol
no_multiples = difference.cumsum(axis=0).cumsum(axis=0) == 1
out_idx1 = no_multiples.any(axis=1)

does the same thing as the function you have above, but I used the cumsum trick from this post to get rid of multiple values inside the tolerance.

Then you need a second set of indices from the other axis to avoid that IndexError. That is what this line does

out_idx2 = no_multiples.any(axis=0)

However your solution returns the correct result for `tol=0.02`, try with `tol=0.2`: the returned arrays have different lengths. Now I'm trying to understand why it's happening. — Péter Leéh, Aug 03 '20 at 10:28

Mad Physicist · Accepted Answer · 2020-08-03T12:48:21.190

A simple way would be to find the distances between all the elements:

dist = np.abs(x1 - x2[:, None])

Since you say that normally you won't have multiple elements within tolerance of any other element, you can do

i2, i1 = np.nonzero(dist < tol)

If you have multiple matches, you can prune the matches first:

i1 = np.argmin(dist, axis=1)
i2 = np.flatnonzero(dist[np.arange(x2.size), i1] < tol)
i1 = i1[i2]

If the original data was sorted, the indices will be too (they will be diagonal-ish). That means that you can check for subsequence length by examining the spacing between indices. A matching sequence will have both indices incrementing by one.

mask = (np.diff(i1) == 1) & (np.diff(i2) == 1)
# smear the mask to include both endpoints
mask = np.r_[False, mask] | np.r_[mask, False]
# pad the mask to ensure proper indexing and find the changeover points
locs = np.diff(np.r_[False, mask, False])
inds = np.flatnonzero(locs)
lengths = inds[1::2] - inds[::2]

You can find the indices of the longest run from the quantities above:

k = np.argmax(lengths)
start = inds[2 * k]
stop = inds[2 * k + 1]
longest_x1 = x1[i1[start:stop]]
longest_y1 = y1[i1[start:stop]]
longest_x2 = x2[i2[start:stop]]
longest_y2 = y2[i2[start:stop]]

It (almost) works. The line `i2 = np.flatnonzero(dist[i1, np.arange(x2.size)] < tol)` raises `IndexError`, however if I leave that part out it returns the correct results. Any idea why is that happening? — Péter Leéh, Aug 03 '20 at 10:32
@PéterLeéh. I flipped the indices on that one. Fixed now. You only need those three lines in place of the first snippet if you need to select between multiple adjacent elements. — Mad Physicist, Aug 03 '20 at 12:46

Longest common subsequence of two datasets with tolerance

2 Answers2