12

First of all, I am new to R (I started yesterday).

I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance.

My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for.

Each point is an array of 64 doubles, so

> dim(data)
[1] 3823   64
> dim(centers)
[1] 10 64

I have tried with

for (i in 1:n) {
  for (j in 1:K) {
    d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
  }
  S[i] <- which.min(d)
}

which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is

distance <- function(point, group) {
  return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}

for (i in 1:n) {
  d <- distance(data[i,], centers)
  which.min(d)
}

Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with

distance <- function(point, group) {
  return (dist(rbind(point,group))[1:nrow(group)])
}

but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower.

I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently?

ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.

dbarbosa
  • 2,969
  • 5
  • 25
  • 29
  • 1
    People 'round here kind-a-don't-like-doing assignments... so try to focus on a specific problem. – aL3xa Jun 12 '10 at 19:38

5 Answers5

14

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.

# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
  dists <- apply(centers, 2, function(center) {
    colSums((x - center)^2)
})
)

Runs in:

utilisateur     système      écoulé 
      0.100       0.008       0.108 

on my laptop.

Jonathan Chang
  • 24,567
  • 5
  • 34
  • 33
  • +1 beats my way to calculate dists matrix. This is nice trick with auto-replication vector added or subtracted from matrix. – Marek Jun 12 '10 at 23:00
  • I am trying to use your solution, but your matrix are transposed. Is there a way to subtract lines like you did with columns? – dbarbosa Jun 12 '10 at 23:12
  • I tried the subtraction with lines using apply but it was not so fast as your solution. I am now transposing the matrix and using your code and it is really fast! Many thanks!!! And also, thank you for your complete answer with a small example and the use of system.time. Merci beaucoup :) – dbarbosa Jun 12 '10 at 23:35
4

rdist() is a R function from {fields} package which is able to calculate distances between two sets of points in matrix format quickly.

https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html

Usage :

library(fields)
#generating fake data
n <- 5
m <- 10
d <- 3

x <- matrix(rnorm(n * d), ncol = d)
y <- matrix(rnorm(m * d), ncol = d)

rdist(x, y)
          [,1]     [,2]      [,3]     [,4]     [,5]
 [1,] 1.512383 3.053084 3.1420322 4.942360 3.345619
 [2,] 3.531150 4.593120 1.9895867 4.212358 2.868283
 [3,] 1.925701 2.217248 2.4232672 4.529040 2.243467
 [4,] 2.751179 2.260113 2.2469334 3.674180 1.701388
 [5,] 3.303224 3.888610 0.5091929 4.563767 1.661411
 [6,] 3.188290 3.304657 3.6668867 3.599771 3.453358
 [7,] 2.891969 2.823296 1.6926825 4.845681 1.544732
 [8,] 2.987394 1.553104 2.8849988 4.683407 2.000689
 [9,] 3.199353 2.822421 1.5221291 4.414465 1.078257
[10,] 2.492993 2.994359 3.3573190 6.498129 3.337441
Deuterium
  • 101
  • 2
  • 7
1

You may want to have a look into the apply functions.

For instance, this code

for (j in 1:K)
    {
    d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
    }

Can easily be substituted by something like

dt <- data[i,]
d <- apply(centers, 1, function(x){ sqrt(sum(x-dt)^2)})

You can definitely optimise it more but you get the point I hope

nico
  • 50,859
  • 17
  • 87
  • 112
  • Thanks... It is a faster than the first code that I wrote but not even close to the strange one using `distance`. – dbarbosa Jun 12 '10 at 19:19
  • 1
    @dbarbosa: well, apparently the `stats::kmeans` package uses compiled code that is obviously faster. Just type `kmeans` and you'll see the source code for it. :) – nico Jun 12 '10 at 20:58
1

dist works fast because is't vectorized and call internal C functions.
You code in loop could be vectorized in many ways.

For example to compute distance between data and centers you could use outer:

diff_ij <- function(i,j) sqrt(rowSums((data[i,]-centers[j,])^2))
X <- outer(seq_len(n), seq_len(K), diff_ij)

This gives you n x K matrix of distances. And should be way faster than loop.

Then you could use max.col to find maximum in each row (see help, there are some nuances when are many maximums). X must be negate cause we search for minimum.

CL <- max.col(-X)

To be efficient in R you should vectorized as possible. Loops could be in many cases replaced by vectorized substitute. Check help for rowSums (which describe also rowMeans, colSums, rowSums), pmax, cumsum. You could search SO, e.g. https://stackoverflow.com/search?q=[r]+avoid+loop (copy&paste this link, I don't how to make it clickable) for some examples.

Community
  • 1
  • 1
Marek
  • 49,472
  • 15
  • 99
  • 121
  • Hi, I am trying to use your code but it is not working. I tried to use it with the same code that @Jonathan Chang wrote, adding: `system.time(outer(seq_len(n), seq_len(K), function(i,j) sqrt(rowSums((x[,i]-centers[,j])^2))))`, but I am getting this error: `Error in dim(robj) <- c(dX, dY) : dims [product 38230] do not match the length of object [64]` Do you see what is wrong? – dbarbosa Jun 12 '10 at 22:46
  • Actually I was not understanding `outer` (I thought it was calling the function once for each pair). Now I am understanding it, thank you, it can be useful! And also, thanks for telling about `max.col`. – dbarbosa Jun 12 '10 at 23:53
1

My solution:

# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
  apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}

You can try it, like:

x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])
Adriano Rivolli
  • 2,048
  • 1
  • 13
  • 13