R function to calculate nearest neighbor distance given [inconsistent] constraint?

Question

I have data consisting of tree growth measurements (diameter and height) for trees at known X & Y coordinates. I'd like to determine the distance to each tree's nearest neighbor of equal or greater size.

I've seen other SE questions asking about nearest neighbor calculations (e.g., see here, here, here, here, etc.), but none specify constraints on the nearest neighbor to be searched.

Is there a function (or other work around) that would allow me to determine the distance of a point's nearest neighbor given that nearest point meets some criteria (e.g., must be equal to or greater in size than the point of interest)?

[An even more complex set of constraints would be even more helpful...]

For my example: specifying that a tree must also be in the same plot as the tree of interest or is the same species as the tree of interest

so you want to only consider nearest neighbors above some height? or is it relative to the height of the tree you're considering? — Shape, Oct 10 '16 at 17:04
@Shape I want it to be relative to the height of the tree of interest. So the constraint cannot be a hard-and-fast rule (e.g., > 10 m), but instead must change with each given point (tree). — theforestecologist, Oct 10 '16 at 17:06

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

I'd do it with non-equijoins and data.table

EDIT: (fyi, this requires data.table 1.9.7, which you can get from github)

EDIT2: did it with a copy of the data.table, since it seems like it was joining on its own threshholds. I'll fix that in future, but this works for now.

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1), 
              allow.cartesian = TRUE]

# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]

# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]

The final dataset contains each pair, according to the given threshholds

EDIT:

With Additional variables:

If you want to join on additional parameters, this allows you to do it, (It's probably even faster if you additionally join on things like plot or species, since the cartesian join will be smaller)

Here's an example joining on two additional categorical variables, species and plot:

 library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2, 
                            height <= thresh1,
                            species == species,
                            plot == plot),
              nomatch = NA,
              allow.cartesian = TRUE]

# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]

# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final

> final
      id         x         y    height species plot  height.1 i.id       i.x       i.y  i.height        dist
  1:   3 0.4837348 0.4325731  91.53387       C    2 111.53387  486 0.5549221 0.4395687 101.53387 0.005116568
  2:  13 0.8267298 0.3137061  94.58949       C    2 114.58949  754 0.8408547 0.2305702 104.58949 0.007111079
  3:  29 0.2905729 0.4952757  89.52128       C    2 109.52128  333 0.2536760 0.5707272  99.52128 0.007054301
  4:  37 0.4534841 0.5249862  89.95493       C    2 109.95493   72 0.4807242 0.6056771  99.95493 0.007253044
  5:  63 0.1678515 0.8814829  84.77450       C    2 104.77450  289 0.1151764 0.9728488  94.77450 0.011122404
 ---                                                                                                        
994: 137 0.8696393 0.2226888  66.57792       C    2  86.57792  473 0.4467795 0.6881008  76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466       A    2 130.14466  338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821       A    2 133.61821  348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652  71.39953       C    3  91.39953  904 0.6954973 0.3690599  81.39953 0.112536771
998: 172 0.6834473 0.6221259  65.52187       A    2  85.52187  783 0.4400028 0.9526355  75.52187 0.168501816
>

NOTE: in the final answer, there are columns height and height.1, the latter appears to result from data.table's equi join and represent the upper and lower boundary respectively.

A Mem-efficient solution

One of the issues here for @theforestecologist was that this requires a lot of memory to do,

(in that case, there were an additional 42 columns being multiplied by the cartesian join, which caused mem issues),

However, we can do this in a more memory efficient way by using .EACHI (I believe). Since we will not load the full table into memory. That solution follows:

library(data.table)
dtree <- data.table(id = 1:1000,
                    x = runif(1000), 
                    y = runif(1000), 
                    height = rnorm(1000,mean = 100,sd = 10),
                    species = sample(LETTERS[1:3],1000,replace = TRUE),
                    plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]

# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]


# for clarity inside the brackets, 
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2 

# Join on a range, must be a cartesian join, since there are many candidates 
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2, 
                                 height <= thresh1,
                                 species,
                                 plot),
              .(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
              by = .EACHI,
              nomatch = NA,
              allow.cartesian = TRUE]

# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]

> test
        id          x          y    height species plot i.id        i.x        i.y  i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
   1:    1 0.17622235 0.66547312  84.68450       B    2  965 0.17410840 0.63219350  93.60226         B      2   74.68450   94.68450           B        2
   2:    2 0.04523011 0.33813054  89.46288       B    2  457 0.07267547 0.35725229  88.42827         B      2   79.46288   99.46288           B        2
   3:    3 0.24096368 0.32649256 103.85870       C    3  202 0.20782303 0.38422814  94.35898         C      3   93.85870  113.85870           C        3
   4:    4 0.53160655 0.06636979 101.50614       B    1  248 0.47382417 0.01535036 103.74101         B      1   91.50614  111.50614           B        1
   5:    5 0.83426727 0.55380451 101.93408       C    3  861 0.78210747 0.52812487  96.71422         C      3   91.93408  111.93408           C        3

This way we should keep total memory usage low.

This is awesome!! I'm about to test it out right now! ...but before I do...you mention speed. Do you think doing the above for 200k+ points would take a ridiculously long time based on your sample set? — theforestecologist, Oct 10 '16 at 17:39
@theforestecologist It's not quite right yet, I think the self-join grabbed the wrong threshholds, fixing — Shape, Oct 10 '16 at 17:43
@theforestecologist for 200k points, it really depends on how many other variables you're joining on. If you expect, by the criteria, around 30 candidates per point, we're going to end up with 6 million rows, so you would need the memory to support that data.table. But with it all in mem, it should be very fast. — Shape, Oct 10 '16 at 17:58
I can't get data.table 1.9.7 to download properly. Do you know which portion of the current CRAN version is missing/not working to allow your code to work? [FYI I've never used data/table before] — theforestecologist, Oct 10 '16 at 18:07
if you use the devtools package, `devtools::install_github('Rdatatable/data.table')` should install it — Shape, Oct 10 '16 at 18:10
@theforestecologist fixed it, I just was confused at to what height.1 was in the final answer. height and height.1 appear to be the range, and are calculated by data.table during the operation — Shape, Oct 10 '16 at 18:23
so unfortunately I ran out of memory during the `id != id` step [I get error `In [.data.table(test, id != i.id) : Reached total allocation of 8182Mb: see help(memory.size)` six times]. The object, `test`, has dimensions 25690097 x 46 at this step. Being new to data.table, I'm not sure if I could somehow subset my table to move forward?? Any recommended work arounds? — theforestecologist, Oct 10 '16 at 18:32
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/125364/discussion-between-shape-and-theforestecologist). — Shape, Oct 10 '16 at 18:33
See my answer here : http://stackoverflow.com/questions/21977720/r-finding-closest-neighboring-point-and-number-of-neighbors-within-a-given-rad/43378579#43378579 — Nico Coallier, Apr 12 '17 at 21:13

R function to calculate nearest neighbor distance given [inconsistent] constraint?

1 Answers1

EDIT:

With Additional variables:

A Mem-efficient solution

Linked