1

I have following problem. I have a number of plots that cover a biological gradient. From these plots, I would like to select 25, that cover the gradient best. To achieve this, I extracted min and max values and calculated the values that would cover the gradient best. I then chose the plots that were the closest match to the ideal value. This works fine. However, sometimes one plot is the closest match to two theoretical values and thus, I end up with duplicates in my list, which I would like to avoid. Obviously, I could increase the number of length.out, but from my perspective, this is not an optimal solution. I would like to end up with 25 selected and unique plots.

The following code exemplifies the problem: length.out is set to 25, but only 19 plots are selected.

data <- structure(list(Plot = c("3", "4", "5", "6", "8", "12", "14", 
"15", "17", "18", "19", "20", "21", "22", "23", "25", "26", "28", 
"29", "30", "32", "33", "34", "35", "36", "37", "38", "39", "40", 
"41", "42", "43", "44", "45", "46", "47", "48", "49"), Value = c(2.19490722347427, 
0.817884294633935, 0.834577676660982, 1.19923035999043, 0.293146158435238, 
1.93237941781986, 1.74536845664897, 2.22904916731729, 0.789604037117133, 
0.439716474953651, 0.834321473446987, 1.07386786707173, 0.977203815084214, 
0.539717907433468, 0.950019385036826, 1.10794069639141, 1.41499437622422, 
1.12933520841724, 1.99342508363262, 1.05715847816517, 2.27711128641038, 
1.9766526350752, 2.16657914911448, 2.01955890337827, 1.1080527140292, 
1.16614766657035, 1.04478527637105, 0.980792736677819, 0.818000882117776, 
0.656157422806534, 1.07223822052094, 0.799912719334531, 0.4365715090508, 
0.824331627537106, 1.19478221856558, 1.06047128780385, 1.54822823084764, 
0.582397279167692)), class = "data.frame", row.names = c("3", 
"4", "5", "6", "8", "12", "14", "15", "17", "18", "19", "20", 
"21", "22", "23", "25", "26", "28", "29", "30", "32", "33", "34", 
"35", "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", 
"46", "47", "48", "49"))

opt_seq<-seq(min(data$Value), max(data$Value), length.out = 25)
sel_plots <- sapply(opt_seq, function(i) which.min(abs(data$Value - i)))#25 plots
length(unique(sel_plots))

I highly appreciate every help!!

ABiologist
  • 43
  • 6
  • What do you want to do when 1 plot is the closest to 2 values? I.e. if your values were 0, 4, 16, 20, and your plots were 0, 2, 6, 7,8,9,10, 14, 18,20. What if the plot with value 20 is missing? What then? You need to make this decision on how to handle situations when there are no "clear" matches. – MrGumble Feb 10 '21 at 08:45
  • If I understand you correctly, you could use a `data.table` rolling join, as described e.g. here: [Find closest value in a vector with binary search](https://stackoverflow.com/a/20133547). – Henrik Feb 10 '21 at 10:10

2 Answers2

2

You can try:

sel_plots <- logical(nrow(data))
for(i in opt_seq) {
  sel_plots[which(!sel_plots)[which.min(abs(data$Value[!sel_plots] - i))]] <- TRUE
}
sel_plots <- which(sel_plots)
length(unique(sel_plots))
#[1] 25
GKi
  • 37,245
  • 2
  • 26
  • 48
1

One way doing this is to find the element with absolute rank and which.min in a for loop and delete that element after each iteration.

y <- data$Value  ## copy values column
r <- c()  ## initialize result vector

for (x in opt_seq) {
  i <- which.min(rank(abs(x - y)))
  r <- c(r, y[i])
  y <- y[-i]
}
r
# [1] 0.2931462 0.4365715 0.4397165 0.5397179 ...
stopifnot(!any(duplicated(r)) & length(r) == 25)
jay.sf
  • 60,139
  • 8
  • 53
  • 110