I have a NxM
matrix and I want to compute the NxN
matrix of Euclidean distances between the M
points. In my problem, N
is about 100,000. As I plan to use this matrix for a k-nearest neighbor algorithm, I only need to keep the k
smallest distances, so the resulting NxN
matrix is very sparse. This is in contrast to what comes out of dist()
, for example, which would result in a dense matrix (and probably storage problems for my size N
).
The packages for kNN that I've found so far (knnflex
, kknn
, etc) all appear to use dense matrices. Also, the Matrix
package does not offer a pairwise distance function.
Closer to my goal, I see that the spam
package has a nearest.dist()
function that allows one to only consider distances less than some threshold, delta
. In my case, however, a particular value of delta
may produce too many distances (so that I have to store the NxN
matrix densely) or too few distances (so that I can't use kNN).
I have seen previous discussion on trying to perform k-means clustering using the bigmemory/biganalytics
packages, but it doesn't seem like I can leverage these methods in this case.
Does anybody know a function/implementation that will compute a distance matrix in a sparse fashion in R? My (dreaded) backup plan is to have two for
loops and save results in a Matrix
object.