1

I am trying to sample between a range of values as part of a larger loop in R. As the loop progresses to each row j, I want to sample a number between the value given in the start column and the value given in the end column, placing that value in the sampled column for that row.

The results should look something like this:

ID  start  end  sampled
a   25     67   44
b   36     97   67
c   23     85   77
d   15     67   52
e   21     52   41
f   43     72   66
g   39     55   49
h   27     62   35
i   11     99   17
j   21     89   66
k   28     65   48
l   44     58   48
m   16     77   22
n   25     88   65

I started using mapply, which samples the whole df, but then I'm trying to fit all 15 sampled values into a single row.

df[j,4] <- mapply(function(x, y) sample(seq(x, y), 1), df$start, df$end)

I thought maybe something using seq might work, but this results in errors saying that from must be of length 1.

df[j,4] <- sample(seq(df$start, df$end),1,replace=TRUE)

The outer looping structure is pretty complicated so I haven't included it here, but the df[j,4] part of the code is necessary because it is part of a larger loop. There are situations where rows have to be resampled based on additional dependencies in the actual dataset. For example, the sampled value of a might need to be larger than b. The rest of the code updates the sampled column, checks for dependencies, and will rerun the sample if the dependencies aren't met. If I can get this sampling section to work, I should be able to plug it in without too much trouble (I hope).

Here's a sample data set.

structure(list(ID = c("a", "b", "c", "d", "e", "f", "g", "h", 
"i", "j", "k", "l", "m", "n"), start = c(25, 36, 23, 15, 21, 
43, 39, 27, 11, 21, 28, 44, 16, 25), end = c(67, 97, 85, 67, 
52, 72, 55, 62, 99, 89, 65, 58, 77, 88), sampled = c(NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -14L), spec = structure(list(
    cols = list(ID = structure(list(), class = c("collector_character", 
    "collector")), start = structure(list(), class = c("collector_double", 
    "collector")), end = structure(list(), class = c("collector_double", 
    "collector")), sampled = structure(list(), class = c("collector_logical", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))```
Corey
  • 405
  • 2
  • 6
  • 18

3 Answers3

1

First, put the data in a format that is easier to use with dput(df):

df <- structure(list(ID = structure(1:14, .Label = c("a", "b", "c", 
    "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n"), class = "factor"), 
    start = c(25L, 36L, 23L, 15L, 21L, 43L, 39L, 27L, 11L, 21L, 
    28L, 44L, 16L, 25L), end = c(67L, 97L, 85L, 67L, 52L, 72L, 
    55L, 62L, 99L, 89L, 65L, 58L, 77L, 88L), sampled = c(44L, 
    67L, 77L, 52L, 41L, 66L, 49L, 35L, 17L, 66L, 48L, 48L, 22L, 
    65L)), class = "data.frame", row.names = c(NA, -14L))

You were very close with mapply() but you made it harder than it needs to be:

df$sampled <- mapply(function(x, y) sample(seq(x, y), 1), df$start, df$end)
df
#    ID start end sampled
# 1   a    25  67      67
# 2   b    36  97      86
# 3   c    23  85      54
# 4   d    15  67      36
# 5   e    21  52      37
# 6   f    43  72      60
# 7   g    39  55      44
# 8   h    27  62      37
# 9   i    11  99      86
# 10  j    21  89      52
# 11  k    28  65      65
# 12  l    44  58      51
# 13  m    16  77      62
# 14  n    25  88      31
dcarlson
  • 10,936
  • 2
  • 15
  • 18
  • `mapply()` works to provide samples for the df, but the `df[j,4]` is necessary because it is part of a larger loop. There are situations where rows have to be resampled. There are additional dependencies in the actual dataset, for instance, the sampled value of `a` might need to be larger than `b`. The rest of the code updates the `sampled` column, checks for dependencies, and will rerun the sample if the dependencies aren't met. – Corey Nov 01 '19 at 04:20
0

You might not need to loop through. If you want need is something between start and end, it's almost equivalent to sampling something between 0-1 and multiplying that by the range.

df %>% mutate(sampled = start + round((end-start)*runif(nrow(.))))

Regarding the updating, dependencies you mentioned in your comment: sounds a bit complicated. Quick thought: Might be faster to sample a lot of times and choose one that fits your criteria.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • It unfortunately is a bit complicated. The dependency issues are here: https://stackoverflow.com/questions/57000883/if-data-present-replace-with-data-from-another-column-based-on-row-id & here: https://stackoverflow.com/questions/58335382/if-values-in-a-range-of-columns-arent-present-in-another-column-replace-with-n . Originally my thought was to sample the whole set & basically "re-roll" if dependencies don't match. The problem is that the data set has 1000+ rows and even a test set of 100 entries can take a long time if you're "unlucky". So unfortunately, going row by row is necessary :\ – Corey Nov 01 '19 at 11:15
0

Figured it out. df[j,4] <- mapply(function(x, y) sample(seq(x, y), 1), df[j,"start"], df[j,"end"])

I just needed to be specific as to which row of the sampled values I wanted to enter into df[j,4]. Specifying row j for columns start and end did the trick.

Corey
  • 405
  • 2
  • 6
  • 18