1

I have a list of N 2-d vectors and want to find out which are the k (=e.g.3) ones which appear the most often.

Vectors which difference (e.g. distance, or which would be the best "similarity measure"?) is less than a threshold th should be counted as the same. All similar vectors can be aggregated by their mean.

So my desired output would be dictionary of k vectors with their respective frequency f.

Minimal Example:

k = 1
input = [[1.0,2.0],[1.1,2.1],[3.0,4.0]]
output = {[1.05,2.05]:2}

What would be the most efficient algorithm to calculate that (pseudocode or python would be nice).

Edit: Vectors that are identical but with opposing directions (e.g. (1,-1) and (-1,1) ) should be counted as same;

gustavz
  • 2,964
  • 3
  • 25
  • 47
  • 2
    Please share a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) – yatu Feb 08 '20 at 18:20
  • Do you want "sameness" to be transitive? I.e., should the list of vector be split into disjoint groups, or can a vector be counted as being "the same" as several different vectors which are not "the same" as each other? Do you need the final dictionary to contain vectors that are in the initial list? – Brian61354270 Feb 08 '20 at 18:21
  • How would you want the input `[10,10],[-10,-10], [12,12], [-12,-12]` to be treated? What about `[[-10,0], [-9,0],...,[-1,0], [0,0], [1,0],...,[9,0], [10,0]]`? – Brian61354270 Feb 08 '20 at 18:50
  • What is the issue, exactly? Have you tried anything, done any research? – AMC Feb 08 '20 at 21:03

2 Answers2

0

Calculate the similarity measure for each vector (i.e. distance d=sqrt(x^2+y^2)) and then sort the list of vectors w.r.t. the list of similarity measures. Sorting a list w.r.t. another list is described in Sorting list based on values from another list?

If you do not want to sort in https://www.geeksforgeeks.org/count-frequencies-elements-array-o1-extra-space-time/ is a O(n) algorithm for frequency counts or use hashing https://www.geeksforgeeks.org/counting-frequencies-of-array-elements/ (also O(n) time complexity )

ralf htp
  • 9,149
  • 4
  • 22
  • 34
  • This does not answer the question. The approach you describe would find the elements in the list which are most "similar" to a given vector, but does not help with extracting the "modes" of the list. – Brian61354270 Feb 08 '20 at 18:26
  • finding the modes or frequencies in a sorted array is extremely easy (https://www.geeksforgeeks.org/count-number-of-occurrences-or-frequency-in-a-sorted-array/) – ralf htp Feb 08 '20 at 18:28
  • The method of sorting itself is nonsensical. What exactly is the "similarity measure" that you are sorting against? The OP is looking to extract "clusters" of nearby vectors from a data set. – Brian61354270 Feb 08 '20 at 18:32
  • The hash algorithm described in https://www.geeksforgeeks.org/counting-frequencies-of-array-elements/ is a good starting point. But in my case not only identical but also similar elements should be combined and counted as the same. – gustavz Feb 08 '20 at 18:32
  • @Brian a similarity measure is i.e. the length of the vector, this is in the question – ralf htp Feb 08 '20 at 18:33
  • @ralfhtp Please read my previous two comments. This approach simply does not apply to the question. Based on your rough description, I assume the "similarity measure" you're proposing is computing the mean "distance" of each vector in the input set from every other vector. This measure provides no information about each vectors relative proximity to its immediate neighbors. Consider a set consisting of two disjoint discs of vectors. How would this method correctly isolate the two sets of "similar" vectors? – Brian61354270 Feb 08 '20 at 18:40
  • 1
    @gustavz You could possibly round (ceil or floor) the values of the similarity measure ? Or choose the next lower even number as category – ralf htp Feb 08 '20 at 18:41
  • @Brian No the similarity measure is i.e. the vector length / vector magnitude (https://onlinemschool.com/math/library/vector/length/). It is calculated to have a single value and not an x,y pair for comparison – ralf htp Feb 08 '20 at 18:43
  • @ralfhtp Two vectors having similar norms does not indicate that they are near to one another. Consider the set {(-100, -100), (100, 100), (90, 90)}. I highly doubt the OP wants (-100, -100) and (100, 100) to be considered similar, and (100, 100) and (90, 90) to be considered dissimilar. – Brian61354270 Feb 08 '20 at 18:47
  • Choosing the similarity measure is another point that is not solved yet, maybe you have a solution – ralf htp Feb 08 '20 at 18:48
  • @ralfhtp It is not solved yet because it is not mathematically feasible. Please consider the examples I have already provided. – Brian61354270 Feb 08 '20 at 18:52
  • 1
    It is mathematically feasible the question is how efficient it can be done,i.e. in another pass determining the quadrant of each vector and then count the frequency for each quadrant. – ralf htp Feb 08 '20 at 19:03
0

A bit late but I'll post my final solution as answer, maybe someone is interested:

import json

def get_most_freq_k(arr,k):  
    d = dict() 
    # Traverse through array elements  
    # and count frequencies 
    for i in range(len(arr)): 
        key = json.dumps(arr[i])
        if key in d.keys(): 
            d[key] += 1
        else: 
            d[key] = 1          
    # return dict with the k most frequent elements
    mostFreqKeys = sorted(d, key=d.get, reverse=True)[:k]
    return {k: d[k] for k in set(mostFreqKeys)}
gustavz
  • 2,964
  • 3
  • 25
  • 47