7

I have a large set of features that looks like this:

id1 28273 20866 29961 27190 31790 19714 8643 14482 5384 ....  upto 1000
id2 12343 45634 29961 27130 33790 14714 7633 15483 4484 ....  
id3 ..... ..... ..... ..... ..... ..... .... ..... .... .... .   .   .
...
id200000 .... .... ... ..  .  .  .  .

I want to compute for each id euclidean distance and sort them to find the 5-nearest points. Because my dataset is very large. what is the best way to do it.

Rafaelopasa
  • 89
  • 1
  • 1
  • 2
  • 5
    Welcome to Stack Overflow! We encourage you to [research your questions](http://stackoverflow.com/questions/how-to-ask). If you've [tried something already](http://whathaveyoutried.com/), please add it to the question - if not, research and attempt your question first, and then come back. –  Sep 11 '12 at 12:15
  • 2
    Are idn different locations (i.e. you are calculating this for a 1000 dimensional space). If so, when you say "the Euclidean distance" to which point? If it's as a group, please could you define "k-closest"... it's not obvious what you mean. – Andy Hayden Sep 11 '12 at 12:18
  • For example if I give an input as id2 to the script. I expect the result: 5-nearest points with respect to id2. I want to compute Euclidean distances from id2 to all the points in the dataset, sort them and return the 5-nearest point. – Rafaelopasa Sep 11 '12 at 14:21

2 Answers2

21

scikit-learn has nearest neighbor search. Example:

  1. Load your data into a NumPy array.

    >>> import numpy as np
    >>> X = np.array([[28273, 20866, 29961, 27190, 31790, 19714, 8643, 14482, 5384, ...],
                      [12343, 45634, 29961, 27130, 33790, 14714, 7633, 15483, 4484, ...], 
                      ...
                      ])
    

    (Just two points shown.)

  2. Fit a NearestNeighbors object.

    >>> from sklearn.neighbors import NearestNeighbors
    >>> knn = NearestNeighbors(n_neighbors=5)
    >>> knn.fit(X)
    NearestNeighbors(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,
             radius=1.0, warn_on_equidistant=True)
    

    p=2 means Euclidean (L2) distance. p=1 would mean Manhattan (L1) distance.

  3. Perform queries. To get the neighbors of X[0], your first data point:

    >>> knn.kneighbors(X[0], return_distance=False)
    array([[0, 1]])
    

    So, the nearest neighbors of X[0] are X[0] itself and X[1] (of course).

Make sure you set n_neighbors=6 because every point in your set is going to be its own nearest neighbor.

Disclaimer: I'm involved in scikit-learn development, so this is not unbiased advice.

normanius
  • 8,629
  • 7
  • 53
  • 83
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 1
    For example if I give an input as "id2" to the script. I expect the result with 5-nearest points with respect to "id2". I want to compute Euclidean distances from "id2" to all the points in the dataset, sort them and return the 5-nearest point. Thanks for your input. I see that you separated the "id numbers" from the dataset. However, I want to keep the 'idn' together with their values in the same array. So that when I sort the 5-nearest points, I could know which ids they belong to. – Rafaelopasa Sep 11 '12 at 14:24
  • 1
    @Rafaelopasa: so? Add one to the index and paste `id` in front. Or keep an array of ids around, if they're not consecutive. – Fred Foo Sep 11 '12 at 14:44
  • 2
    What if I would like to find `NearestNeighbors` to new data point? – AAAA Sep 09 '20 at 08:45
  • 1
    Can't see a way to get the cosine similarity (i.e. dot product)? – CpILL Oct 21 '21 at 02:54
3

From your question it is not entirely clear what the specifics of your problem are. I understood so far, that you need to calculate euclidean distances between a large amount of data points. The fastest solution in Python probably makes use of the scipy.spatial.distance module. Please have a look at

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

and

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

You will have to make yourself familiar with the numpy data types, develop input data for one of these functions and further evaluate the resulting data. You'll probably end up trying to get some maximum/minimum N values of an array, at which point How to get indices of N maximum values in a numpy array? could help.

Community
  • 1
  • 1
Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130
  • 1
    Maybe aborting computation if the sum goes beyond a limit (i. e. when it is clear that the result will be larger than some other already computed) will speed up the process. Don't know whether this can be done in scipy, though. – Alfe Sep 11 '12 at 12:28
  • 1
    For example if I give an input as "id2" and the above "feature-set file.txt" to the script. I expect the result with 5-nearest points with respect to "id2". I want to compute Euclidean distances from "id2" to all the points in the dataset, sort them and return the 5-nearest point. Thanks for your input – Rafaelopasa Sep 11 '12 at 14:31