1

I have a data frame with a group of x and y points. I need to calculate the euclidean distance of every point relative to every other point. Then I have to figure, for each row, how many are within a given range.

For example, if I had this data frame:

x y
- -
1 2
2 2
9 9

I should add a column that signals how many points (if we consider these points to be in a cartesian plane) are within a distance of 3 units from every other point.

x y n
- - -
1 2 1
2 2 1
9 9 0

Thus, the first point (1,2) has one other point (2,2) that is within that range, whereas the point (9,9) has 0 points at a distance of 3 units.

I could do this with a couple of nested for loops, but I am interested in solving this in R in an idiomatic way, preferably using dplyr or other library.

This is what I have:

ddply(.data=mydataframe, .variables('x', 'y'), .fun=count.in.range)

count.in.range <- function (df) {
  xp <- df$x
  yp <- df$y
  return(nrow(filter(df, dist( rbind(c(x,y), c(xp,yp)) ) < 3 )))
}

But, for some reason, this doesn't work. I think it has to do with filter.

Federico
  • 3,650
  • 4
  • 32
  • 45
  • I would suggest you work with pairs of points in the long format and then use a `data.table` solution, which is probably one of the fastest and memory-efficient alternatives to work with large datasets. [There is a really fast solution to a very similar problem, here](http://stackoverflow.com/questions/36817423/how-to-efficiently-calculate-distance-between-pair-of-coordinates-using-data-tab) – rafa.pereira May 31 '16 at 20:47
  • @Frederico, have you had the chance to test my answer below ? – rafa.pereira Jun 04 '16 at 22:41

3 Answers3

1

Given

df_ <- data.frame(x = c(1, 2, 9),
                  y = c(2, 2, 9))

You can use the function "dist":

matrix_dist <- as.matrix(dist(df_))
df_$n <- rowSums(matrix_dist <= 3)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
Michele Usuelli
  • 1,970
  • 13
  • 15
  • What if I have thousands of rows? I have 64 GB of RAM and it's still not enough to compute that matrix. I updated my question with a possible answer. What do you think? – Federico May 27 '16 at 23:37
1

This is base approach with straightforward application of a "distance function" but only on a row-by-row basis:

apply( df_ , 1, function(x) sum( (x[1] - df_[['x']])^2+(x[2]-df_[['y']])^2 <=9 )-1 )
#[1] 1 1 0

It's also really a "sweep" operation, although I wouldn't really expect a performance improvement.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

I would suggest you work with pairs of points in the long format and then use a data.table solution, which is probably one of the fastest alternatives to work with large datasets

library(data.table)
library(reshape)

df <- data.frame(x = c(1, 2, 9),
                 y = c(2, 2, 9))

The first thing you need to do is to reshape your data to long format with all possible combinations of pairs of points:

df_long <- expand.grid.df(df,df)

# rename columns
  setDT(df_long )
  setnames(df_long, c("x","y","x1","y1"))

Now you only need to do this:

# calculate distance between pairs
  df_long[ , mydist := dist ( matrix(c(x,x1,y,y1), ncol = 2, nrow = 2) ) , by=.(x,y,x1,y1)]

# count how many points are within a distance of 3 units 
  df_long[mydist <3 , .(count = .N), by=.(x,y)]

#>    x y count
#> 1: 1 2     2
#> 2: 2 2     2
#> 3: 9 9     1
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109