Algorithm, closest point between list elements

Question

I have n ordered lists of unequal size (where I do not know beforehand how many lists there are going to be). I need to find the minimum average distance between one element in each list.

For example given n=3 for three lists:

a = [14, 22, 36, 48]
b = [14, 23, 30, 72]
c = [1, 18, 24]

The output should be (22,23,24) because:

mean(abs(22-23), abs(23-24), abs(22-24)) = 1.33333

which is the smallest among all the points in the example above.

I tried to implement it in Python as following

def aligner(aoa):
'''
read arrays of arrays of peaks and return closest peaks
'''
#one of arrays is empty
if not [y for x in aoa for y in x]:
    return None
# there is the same nr in all array no need to do anything
candidate = set.intersection(*map(set, aoa))
if candidate:
    # returns intersect
    return [max(list(candidate))] * len(aoa)
else:
    #tried cartesian product via bumpy malloc err
    pass

My doubt is now regarding the implementation of the other part. I thought about using the cartesian product for generating all combination but is going into memory issues. My guesses would be do generate all the combination somehow (maybe itertools??) and loop through all of those but I don't know if there s any algorithm that tackles this problem that I could use.

I do not need code but only hints whether there is any efficient way for solving this or the brute force with n for loops on the permutated lists is the only one

EDIT

Regarding the size of the problem the nr of list is maximum 100 (fixed) while the nr of elements can be vary but I would say that the presented example with 4 or 5 points per list is a realistic scenario.

All points are non-negative.

Tried the proposed itertools solution but of course not memory issues but has been running since hours and it's stuck on the 3rd element.

Seems like more of a graph theory like problem? https://math.stackexchange.com/questions/581831/calculating-the-shortest-distance-between-n-number-of-points-on-a-scatter-graph. If not, perhaps a tree-like structure for each list would be your best bet. — ZaxR, Jul 01 '18 at 14:11
It isn't quite clear what is meant by "minimize the difference between all points". You can minimize one number. — n. m. could be an AI, Jul 01 '18 at 14:23
From the set of points chosen from the sets (I'm assuming one point per set), you can compute pairwise differences. It's the sum of those differences the OP wants to minimize, I believe. — chepner, Jul 01 '18 at 15:32
@n.m. yes as written by chepner I want to minimize the pairwise absolute difference between all points in a set (or alternatively as said also the sum of the difference can be used or even the standard deviation between the proposed points) — D.A., Jul 02 '18 at 13:26
If memory is the only problem, you can just use `itertools.product` to iterate all the possible combinations, without realizing them in a list, first. But then, if memory is an issue, it's likely that time will also be, and just using a generator won't fix that. — tobias_k, Jul 02 '18 at 15:55

score 2 · Answer 1 · answered Jul 02 '18 at 09:52

2

First of all, optimizing the mean of the differences is the same as optimizing the sum of the differences.

If you model you problem as directed graph, this can be solved:

Let your lists be A, B, C. Every entry of your lists is a vertex of the graph v_ai where a is the list and i the index.

For every index i in A, j in B, add an edge v_ai -> v_bj with weigth abs(A(i) - B(j))

For every index i in B, j in C, add an edge v_bi -> v_cj with weigth abs(B(i) - C(j))

For every index i in C, j in A, add an edge v_ci -> v_aj with weigth abs(C(i) - A(j))

What you are looking for now is the minimum cycle in this graph. Use this answer for an O(n^3) algorithm. (a modified Floyd-Warshall algorithm)

answered Jul 02 '18 at 09:52

kutschkem

7,826
3
21
56

Is it really just a cycle? What if there are four lists? – tobias_k Jul 02 '18 at 14:58
@tobias_k Then you do A -> B -> C -> D -> A – kutschkem Jul 02 '18 at 15:12
@tobias_k ... no that isn't right, is it? Since then you are missing A - C, and B - D ... – kutschkem Jul 02 '18 at 15:13
@tobias_k On the other hand, | A - C | <= | A - B | + | B - C | (triangle inequality), so at least it is an approximation. I will see if I can prove that a cycle minimizes the whole thing, but I am not really sure. – kutschkem Jul 02 '18 at 15:21

nerryoob · Accepted Answer · 2018-10-11T19:01:11.997

This method is a brute force method, but uses a method of elimination similar to Dijkstra's algorithm, that results in far fewer cases (making an algorithm that is most likely orders of magnitude faster, especially for big lists or lots of lists). Tell me if you don't understand it, and I can clarify. The implementation can be found here: https://github.com/nerryoob/closestPoint

What you are doing is making a list of the different combination of numbers (i.e. answers)? Best at the start (index 0), worst at the end, or vice versa, see what works best. You'll make the result list for the first input list only, completely ignoring the others. For one list, of course, all the items are the solution - they all have total difference 0. So just copy the first input list into the result list

Next, probably with a while loop, follow this algorithm. Take the top item and pop it from the result list. Store its value. Go to the next input list and for each item in that next input list, make a copy of the top item you just popped that also has the item of the next input list. Find the new overall difference and insert the new item based off of that into the list. Repeat until the top solution has all of the lists in. This means that you guarantee you have the best solution (of at least joint 1st), whilst spending far less time on the combinations that are obviously not the solution

Example (the number in brackets is the total difference)

[14, 22, 36, 48] [14, 23, 30, 72] [1, 18, 24]

Result list is [14(0), 22(0), 36(0), 48(0)]

Look at 14. Insert the new numbers [14 and 14(0), 22(0), 36(0), 48(0), 14 and 23(9), 14 and 30 (16), 14 and 72 (58)]
Look at 14 & 14. Insert the new numbers [22(0), 36(0), 48(0), 14 and 14 and 18(8), 14 and 23 (9),14 and 30 (16), 14 and 14 and 24 (20), 14 and 14 and 1(26), 14 and 72(58)]
Look at 22. Insert the new numbers [36(0), 48(0), 22 and 23(1), 14 and 14 and 18(8), 22 and 14(8), 22 and 30(8), 14 and 23 (9),14 and 30 (16), 14 and 14 and 24 (20), 14 and 14 and 1(26), 22 and 72(50), 14 and 72(58)]

Keep on repeating and you end up with 22, 23, 24 on top. Because this has all n lists in it, you can therefore stop and give back the answer

To optimise it:

Remove duplicates
Perhaps exploit ordered lists in some way
Think about where you put items with the same total difference, perhaps items with more numbers coming first

EDIT: Algorithmic complexity is O(n^2)

can you say something about the complexity of your algorithm? — kutschkem, Jul 02 '18 at 11:10
Is not clear to me how to stop it (coding it right now). By checking the nr of elements in a solution you would stop with 14-14-18 and if you do want to check the top solution you need to have all the combination for finding the best one. — D.A., Jul 03 '18 at 08:04
D.A I think you misinterpreted it. If a solution is at the top and has all the lists in, then by definition it is the best, as the total differences for the others cannot go down as you make them more complete. Carry on you end up with 22 and 23 on top, add in 24 and you end up with 22 and 23 and 24 on top with a total difference of 3- the correct solution — nerryoob, Jul 03 '18 at 11:27
The implementation has been tested and does work. If you want I can create an interactive one that walks through the steps — nerryoob, Jul 03 '18 at 11:48

score 0 · Answer 3 · answered Jul 01 '18 at 20:38

I'm not sure about the best way to find the optimal solution but one heuristic might be to examine ranges. If our lists are sorted, we can check if the list has an element within a range using a binary search. So we can divide and conquer, attempting to narrow ranges that include an element from each list. Because of the nature of mean computation, we might unfortunately also be interested in ranges that include elements from many but not all lists, since a collection of very close numbers with a few outliers might produce a smaller differences-mean than more variation within a smaller range; this complicates the solution quite a bit.

score 0 · Answer 4 · answered Jul 02 '18 at 16:02

We don't really know a lot about the size of your problem, i.e. how many lists, and how many elements per list. For starters, and for setting a baseline, you could just use itertools.product to iterate all the possible combinations of elements from the three lists, without realizing them in a list. You can then iterate those and find the best one, or pass them directly into min and use a special key function using itertools.combinations and sum to find the one with the lowest average distance (if the sum is lowest, so is the average).

>>> a = [14, 22, 36, 48]
>>> b = [14, 23, 30, 72]
>>> c = [1, 18, 24]
>>> len(list(itertools.product(a, b, c)))
48
>>> min(itertools.product(a, b, c),
...     key=lambda t: sum(abs(n-m) for n, m in itertools.combinations(t, 2)))
(22, 23, 24)

Depending on the size of your problem, this might be much too slow, but maybe it's sufficient.

the nr of list is maximum a 100 and the nr of points within each list could be variable but not more than 35-40 (while the average should be below 10) — D.A., Jul 03 '18 at 07:34
@D.A. Okay, with 100 lists with 10 elements each, you have 10^100 combinations, so this won't work. Still, this might help others with having a similar but smaller problem. — tobias_k, Jul 03 '18 at 08:19

Algorithm, closest point between list elements

4 Answers4