R: randomly sampling a dataframe based on another dataframe with a range

Question

So I have 2 dataframes, one (dfA) with climate data that has over 100,000 rows, and one (dfB) as a list of sampling criteria (seed zone, altitude, and freq) with only about 50 rows.

What I'm aiming to do is use dfB as a basis to randomly sample rows from dfA based on shared columns (seed zone and altitude), picking n rows based on the freq column. This has been successful! The first code below is what I have managed to make work as far as I can tell, but it gives very limited results as the values have to match exactly and very few do.

Ideally, what I'd like to do, is have the sampling happen as above, but across a range of altitudes based on the dfB$Altitude value (say +-10)

Code for basic sampling (works):

sample <- dfA %>%
  inner_join(dfB) %>% 
  group_by(Seed_zone, Altitude) %>%
  sample_n(first(freq))

Code I've tried (in multiple variations) that hasn't worked

sample <- dfA %>%
  filter(dfA > dfB -10 & dfA < dfB +10) %>%
  left_join(dfB, by = c("Seed_zone")) %>%
  group_by(Seed_zone) %>%
  sample_n(first(freq))

dfA example data

Climate data	Altitude	Seed_zone
5.107707	305	101
4.845323	499	203
2.985152	7	101
22.17268	299	101

dfB example data

Altitude	Seed_zone	freq
300	101	2
502	203	1

Expected outcome (using +- 10)

Climate data	Altitude	Seed_zone	freq
5.107707	305	101	2
4.845323	499	203	1
22.17268	299	101	2

score 0 · Accepted Answer · answered Jun 27 '23 at 21:46

We can use join_by. Note that when joining on an inequality, we cannot calculate on the fly, so we pre-calc the +/-0 10.

library(dplyr)
dfB %>%
  mutate(m10 = Altitude-10, p10 = Altitude+10) %>%
  inner_join(dfA, join_by(Seed_zone, m10 <= Altitude, p10 >= Altitude), suffix = c(".B", "")) %>%
  group_by(Seed_zone, Altitude.B) %>%
  sample_n(first(freq)) %>%
  ungroup()
# # A tibble: 3 × 7
#   Altitude.B Seed_zone  freq   m10   p10 `Climate data` Altitude
#        <int>     <int> <int> <dbl> <dbl>          <dbl>    <int>
# 1        300       101     2   290   310           5.11      305
# 2        300       101     2   290   310          22.2       299
# 3        502       203     1   492   512           4.85      499

In this case, Altitude.B is from dfB (ergo why I intentionally change the suffix) and Altitude is the original from dfA (I thought it easier to keep this unchanged .. perhaps I should have used suffix = c(".B", ".A")).

The issue I'm getting now is: Error in 'sample_n()': Can't compute indices In group 1: 'Seed_zone = 101', 'Altitude.B = 300'. Caused by error: 'size' must be less than or equal to 1 (size of data) — user22141110, Jun 27 '23 at 22:39
Also if I change sample_n() to slice_sample(n = freq) I get: Error in 'slice_sample()': 'n' must be a round number, not a object. — user22141110, Jun 27 '23 at 22:44
Good news! It's a slightly messy output but I can get it to work by changing it to: sample_n(size = first(freq), replace = TRUE) — user22141110, Jun 27 '23 at 22:52

R: randomly sampling a dataframe based on another dataframe with a range

1 Answers1

Linked