Looping and finding nearest data points for sub-groups

Question

I'm trying to replicate the work done in the really cool nearest neighbor question but do it for each area in my dataframe rather than the group as a whole.

My data ncbaby (don't ask) looks like this:

     id      printid     areaname latitude longitude
1  7912048 233502729     073    36.06241 -80.44229
2   735253 171241999 Area 12-06 35.54452 -78.75388
3  4325564  85564887 Area 12-04 35.49328 -78.73756
4  4222241  85461255 Area 12-06 35.53621 -78.75553
5 11997754 356053648 Area 12-04 35.49328 -78.73756
6 13444458 536073775 Area 12-06 35.53987 -78.74922

I'd like to run the function for each areaname. I tried calling split but but the distance function won't call on a list.

splitfile <- split(ncbaby, ncbaby$precinctname)

c <- gDistance(splitfile, byid=TRUE)

Error in (function (classes, fdef, mtable)  : 
unable to find an inherited method for function ‘is.projected’ for signature ‘"list"’

The closest I've gotten is:

library(sp)
library(rgeos)

uniq <- unique(unlist(ncbaby$areaname))
for (i in 1:length(uniq)){
    data_1 <- subset(ncbaby, areaname == uniq[i])
    sp.mydata <- data_1
    coordinates(sp.mydata) <- ~longitude+latitude
    d <- gDistance(sp.mydata, byid=TRUE)
    min.d <- apply(d, 1, function(x) order(x, decreasing=F)[2])
    newdata <- cbind(data_1, data_1[min.d,], apply(d, 1, function(x) sort(x, decreasing=F)[2]))
    colnames(newdata) <- c(colnames(data_1), 'n.ID',
                          'n.printid', '.Areaname', 'n.latitude', 'n.longitude',
                          'distance')
}

The problem here is that it ends up kicking out only the last returned value. ideas? I'd be interested/open to modifying the function as well, as it seems it might be more efficient.

Please specify the libraries used. gDistance is not a part of base R. — Dave2e, Aug 31 '16 at 23:32
added, sorry for the omission, thanks for the editing help/advice. — ike, Aug 31 '16 at 23:47
I'm confused. You say *function wouldn't call on a list* but then say *problem here is that it ends up kicking out only the last returned value*. So which is it? Does function not work? Or does it yield undesired results? Immediately, it can be seen your `for` loop overwrites dataframe each time. — Parfait, Sep 01 '16 at 00:21
thank you @Parfait for the question and pointing out the for loop item. I added in the function call that failed. Do you have any guidance on making the for loop append instead of overwrite? — ike, Sep 01 '16 at 00:29
You need to use lapply or tapply function instead of the loop. have a look at this one http://stackoverflow.com/questions/19261159/r-applying-a-function-to-a-subset-of-a-data-frame — MFR, Sep 01 '16 at 00:36

jazzurro · Answer 1 · 2016-09-01T03:04:14.447

I checked the linked post and revised the idea a bit. I thought using apply() may not be a good idea for a large data set. So I rather used a data.table-related approach. First, I converted my sample data to a SpatialPointsDataFrame. Then, I split the data by a group variable (i.e.,group). As Eddie suggested, I utilized lapply() with data.table functions. When you use gDistance(), you have a two-dimensional vector as an output. I converted that to a data.table object so that following data process potentially goes faster. I reshaped the dt object with melt() and remove all data points with distance = 0. Finally, I took the first row for each Var1. Please note that Var1 here represents each row of the sample data, mydf. The last job was to add the new distance vector to the original data frame. I hope this will help you.

DATA

   group user_id  latitude longitude
1    B23   85553 -34.44011  172.6954
2    B23   85553 -34.43929  172.6939
3    B23   85553 -34.43929  172.6939
4    B23   85553 -34.43851  172.6924
5    B23   57357 -34.42747  172.6778
6    B23   57357 -34.42747  172.6778
7    B23   57357 -34.42747  172.6778
8    B23   98418 -34.43119  172.7014
9    B23   98418 -34.43225  172.7023
10   B23   98418 -34.43224  172.7023
11   B23   98418 -34.43224  172.7023
12   B24   57357 -34.43647  172.7141
13   B24   57357 -34.43647  172.7141
14   B24   57357 -34.43647  172.7141
15   B24   98418 -34.43904  172.7172
16   B24   98418 -34.43904  172.7172
17   B24   98418 -34.43904  172.7172
18   B24   98418 -34.43925  172.7168
19   B24   98418 -34.43915  172.7169
20   B24   98418 -34.43915  172.7169
21   B24   98418 -34.43915  172.7169
22   B24   98418 -34.43915  172.7169

CODE

library(sp)
library(rgeos)
library(data.table)

# Copy the original
temp <- mydf

#DF to SPDF
coordinates(temp) <- ~longitude+latitude

# Split the data by a group variable
mylist <- split(temp, f = temp$group)

#For each element in mylist, apply gDistance, reshape the output of
# gDistance and create a data.table. Then, reshape the data, remove
# rows with distance = 0. Finally, choose the first row for each 
# variable. levels in variable represents rows in mydf.

out <- rbindlist(
        lapply(mylist, function(x){

           d <- setDT(melt(gDistance(x, byid = TRUE)))
           setorder(d, Var1, value)
           d <- d[value > 0]
           d <- d[, .SD[1], by = Var1]
           d 

        })
    )

out <- cbind(mydf, distance = out$value)

#   group user_id  latitude longitude     distance
#1    B23   85553 -34.44011  172.6954 1.743945e-03
#2    B23   85553 -34.43929  172.6939 1.661118e-03
#3    B23   85553 -34.43929  172.6939 1.661118e-03
#4    B23   85553 -34.43851  172.6924 1.661118e-03
#5    B23   57357 -34.42747  172.6778 1.836642e-02
#6    B23   57357 -34.42747  172.6778 1.836642e-02
#7    B23   57357 -34.42747  172.6778 1.836642e-02
#8    B23   98418 -34.43119  172.7014 1.369100e-03
#9    B23   98418 -34.43225  172.7023 1.456022e-05
#10   B23   98418 -34.43224  172.7023 1.456022e-05
#11   B23   98418 -34.43224  172.7023 1.456022e-05
#12   B24   57357 -34.43647  172.7141 3.862696e-03
#13   B24   57357 -34.43647  172.7141 3.862696e-03
#14   B24   57357 -34.43647  172.7141 3.862696e-03
#15   B24   98418 -34.43904  172.7172 3.245705e-04
#16   B24   98418 -34.43904  172.7172 3.245705e-04
#17   B24   98418 -34.43904  172.7172 3.245705e-04
#18   B24   98418 -34.43925  172.7168 1.393162e-04
#19   B24   98418 -34.43915  172.7169 1.393162e-04
#20   B24   98418 -34.43915  172.7169 1.393162e-04
#21   B24   98418 -34.43915  172.7169 1.393162e-04
#22   B24   98418 -34.43915  172.7169 1.393162e-04

DATA in dput()

mydf <- structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("B23", 
"B24"), class = "factor"), user_id = c(85553L, 85553L, 85553L, 
85553L, 57357L, 57357L, 57357L, 98418L, 98418L, 98418L, 98418L, 
57357L, 57357L, 57357L, 98418L, 98418L, 98418L, 98418L, 98418L, 
98418L, 98418L, 98418L), latitude = c(-34.440114, -34.43929, 
-34.43929, -34.438507, -34.427467, -34.427467, -34.427467, -34.431187, 
-34.432254, -34.43224, -34.43224, -34.436472, -34.436472, -34.436472, 
-34.439038, -34.439038, -34.439038, -34.439246, -34.439149, -34.439149, 
-34.439149, -34.439149), longitude = c(172.695443, 172.693906, 
172.693906, 172.692441, 172.677763, 172.677763, 172.677763, 172.701413, 
172.702284, 172.702288, 172.702288, 172.71411, 172.71411, 172.71411, 
172.717203, 172.717203, 172.717203, 172.716798, 172.716898, 172.716898, 
172.716898, 172.716898)), .Names = c("group", "user_id", "latitude", 
"longitude"), row.names = c(NA, -22L), class = "data.frame")

You say that gDistance is a two-dim vector, aka a matrix (?), and there is a method for melting matrices directly in reshape2. — Frank, Sep 01 '16 at 02:41
@Frank Do you mean `melt(gDistance(x, byid = TRUE)` is enough then? — jazzurro, Sep 01 '16 at 02:47
Yeah, I think so, but it will only work if reshape2 is installed (from what I gather from the source code of `melt` when data.table is loaded). — Frank, Sep 01 '16 at 02:52
@Frank Ultimately, `d <- setDT(melt(gDistance(x, byid = TRUE))` then, if we insist using data.table? — jazzurro, Sep 01 '16 at 02:56
learned some new packages from your post. Good one. Pluse one. — akrun, Sep 01 '16 at 04:04
@akrun I am glad to hear that and thanks for your encouragement. — jazzurro, Sep 01 '16 at 04:05

aichao · Answer 2 · 2016-09-01T05:12:14.290

Here is a solution that adapts the solution in the linked post to separate groups by area. First, we define two functions:

library(sp)
library(rgeos)

nearest.neighbor <- function(lon,lat) {
  df <- data.frame(lon,lat)
  coordinates(df) <- ~lon+lat
  d <- gDistance(df, byid=TRUE)
  # remove the self distance from being considered and use which.min to find the nearest neighbor
  d[cbind(1:nrow(d),1:nrow(d))] <- NA
  min.d <- rbind(apply(d,1,function(x) {ind <- which.min(x); list(ind=ind,distance=x[ind])}))
}

order.by.ind <- function (x,ind) x[ind]

The nearest.neighbor function follows closely to the code in the linked post except that it returns a vector of lists. Each list contains the index to the nearest neighbor and the distance to that neighbor. The key here is that we only want to compute the distance once to return both the min distance and the corresponding index. Note that we remove the self-distance from consideration by setting the diagonal of d to NA and then use which.min to locate the nearest neighbor, which avoids having to do a full sort.

The order.by.ind function simply reorders the input column x according to the indices ind.

With these two functions, we can use mutate from the dplyr package to compute the desired columns grouped by areaname:

library(dplyr)

result <- ncbaby %>% group_by(areaname) %>%
                     mutate(min.d=nearest.neighbor(longitude, latitude)) %>%
                     mutate_each(vars=c(id, printid, latitude, longitude),
                                 funs(order.by.ind, "order.by.ind", order.by.ind(.,ind=unlist(min.d)[c(TRUE,FALSE)]))) %>%
                     mutate(distance=unlist(min.d)[c(FALSE,TRUE)]) %>%
                     mutate(.Areaname=areaname) %>%
                     select(-min.d)

newvars <- c('n.ID', 'n.printid', 'n.latitude', 'n.longitude', 'distance', '.Areaname')
colnames(result) <- c(colnames(ncbaby), newvars)

Notes:

The first mutate creates a temporary column min.d containing the list of ind and distance to the nearest neighbor. This is the nearest neighbor in the area since we are group_by areaname.
The second mutate_each creates a new column for each variable in vars by reordering that column in accordance to the ind. Note that we extract the ind from min.d by unlisting and then extracting the odd elements using [c(TRUE,FALSE)].
The third mutate creates the distance column by extracting the distance from min.d. Again, this is by unlisting but then extracting the even elements using [c(FALSE,TRUE)].
The fourth mutate is actually not needed because the .Areaname column will be redundant to areaname in the result.
Finally, we remove the min.d column from the result and set the column names of the resulting data frame as desired.

The result using your data is:

print(result)
##Source: local data frame [7 x 11]
##Groups: areaname [3]
##
##        id   printid   areaname latitude longitude     n.ID n.printid n.latitude n.longitude    distance  .Areaname
##     <int>     <int>     <fctr>    <dbl>     <dbl>    <int>     <int>      <dbl>       <dbl>       <dbl>     <fctr>
##1  7912048 233502729        073 36.06241 -80.44229  7912049 233502730   36.06251   -80.44329 0.001004988        073
##2   735253 171241999 Area 12-06 35.54452 -78.75388 13444458 536073775   35.53987   -78.74922 0.006583168 Area 12-06
##3  4325564  85564887 Area 12-04 35.49328 -78.73756 11997754 356053648   35.49328   -78.73756 0.000000000 Area 12-04
##4  4222241  85461255 Area 12-06 35.53621 -78.75553 13444458 536073775   35.53987   -78.74922 0.007294635 Area 12-06
##5 11997754 356053648 Area 12-04 35.49328 -78.73756  4325564  85564887   35.49328   -78.73756 0.000000000 Area 12-04
##6 13444458 536073775 Area 12-06 35.53987 -78.74922   735253 171241999   35.54452   -78.75388 0.006583168 Area 12-06
##7  7912049 233502730        073 36.06251 -80.44329  7912048 233502729   36.06241   -80.44229 0.001004988        073

where I added a new row for areaname="073" so that each area has at least two rows.

Looping and finding nearest data points for sub-groups

2 Answers2