1

I have two arrays which contains instances from DATA called A and B. These two arrays then refer to another array called Distance.

I need the fast way to:

  1. find the points combination between A and B,
  2. find the results of the distance from the combination in Distance

For example:

DATA = [0,1,...100]
A = [0,1,2]
B = [6,7,8]
Distance = [100x100] # contains the pairwise distance of all instances from DATA

# need a function to combine A and B
points_combination=[[0,6],[0,7],[0,8],[1,6],[1,7],[1,8],[2,6],[2,7],[2,8]]

# need a function to refer points_combination with Distance, so that I can get this results
distance_points=[0.346, 0.270, 0.314, 0.339, 0.241, 0.283, 0.304, 0.294, 0.254]

I already try to solve it myself, but when it deals with large data it's very slow

Here's the code I tried:

import numpy as np
def function(pair_distances, k, clusters):
    list_distance = []
    cluster_qty = k

    for cluster_id in range(cluster_qty):
        all_clusters = clusters[:]                 # List of all instances ID on their own cluster
        in_cluster = all_clusters.pop(cluster_id)  # List of instances ID inside the cluster
        not_in_cluster = all_clusters              # List of instances ID outside the cluster
        # combine A and B array into a points to refer to Distance array
        list_dist_id = np.array(np.meshgrid(in_cluster, np.concatenate(not_in_cluster))).T.reshape(-1, 2)

        temp_dist = 9999999
        for instance in range(len(list_dist_id)):
            # basically refer the distance value from the pair_distances array
            temp_dist = min(temp_dist, (pair_distances[list_dist_id[instance][0], list_dist_id[instance][1]])) 
        list_distance.append(temp_dist)
    return list_distance

Notice that the nested loop is the source of the time consuming problem. This is my first time asking in this forum, so please let me know if you need more information.

Kremi Jowo
  • 11
  • 3

1 Answers1

0

The first part(points_combination) is extensively covered in this post already:

Cartesian product of x and y array points into single array of 2D points

The second part (distance_points): seems that algorithm linking points_combination to distance_points is not provided. Would be helpful if you could provide small sample data sets indicating how to go from data sets to your distance_points ?

belamy
  • 44
  • 6
  • points_combination which contain set of points(a,b) then search in the Distance array which contains all the distance of points from the dataset, to get distance_points. – Kremi Jowo Feb 23 '22 at 01:48
  • It is obvious that the bottleneck depends on how large cluster_qty and list_dist_id are. Depending on how many times you loop, It eats up power every time you do list_dist_id = np.array(np.meshgrid(in_cluster, np.concatenate(not_in_cluster))).T.reshape(-1, 2) – belamy Feb 23 '22 at 12:25
  • Yep, that's why if anyone has idea to convert that loop into something faster, it would be very helpful. – Kremi Jowo Feb 25 '22 at 23:32
  • It is up to your software design and algorithm. If the best algo of your design is to loop through cluster_qty (say a 1000 times), and loop through list_dist_id (say a 1000 times), and the algo to find list_dist_id is np.array(np.meshgrid ... ), then this is the python code for the algo. – belamy Feb 26 '22 at 00:10