Calculate mean of random extract samples

Question

I am trying to extract random samples from 2 columns of my database (hours of work and relative amount of patients visited), and then I would like to calculate the mean progressively. By that I mean, the mean between the firsts 2 samples, then the mean between the mean I just calculated and the third sample...and so on.

Is it possible? Is there a function for that?

Thank you all for the help.

L.

This is how I am extracting the samples.

library(dplyr)

set.seed(2020)
obs <- rnorm(10, mean = 0, sd = 1)
time <- rnorm(10, mean = 0.5, sd = 1)
rdf <- data.frame(obs, time)
sample_n(rdf, 1)

p <- replicate(100, expr = (sample_n(rdf, 1) + sample_n(rdf, 1))/2)

People often use words like progressive and dynamic which can mean almost anything and therefore, ... nothing. This case calls for the use of the term "recursive". — IRTFM, Feb 20 '23 at 03:19
@IRTFM Any ideas why recursion is **faster** than a loop in [this](https://stackoverflow.com/a/75506559/6574038) case below? — jay.sf, Feb 20 '23 at 07:56
@jay.sf. Not sure. But I’m also wondering if the use of `Recall` might improve the performance of the recursion approach. — IRTFM, Feb 20 '23 at 16:18
@IRTFM `Recall` didn't change the performance noticeable. But a very interesting feature that keeps the recursive function working if you rename it. Thanks! — jay.sf, Feb 21 '23 at 06:44

S-SHAAF · Answer 1 · 2023-02-19T23:35:59.567

One option is to use a for loop and determine the number of samples you would like. For example if we want to take 5 samples and calculate the means progressively we could do a loop which starts with first sample and iteratively selects the next sample. Then calculates the mean between the previous mean and the next sample:

set.seed(2020)
obs <- rnorm(10, mean = 0, sd = 1)
time <- rnorm(10, mean = 0.5, sd = 1)
rdf <- data.frame(obs, time)

nsamp <- 5  # number of samples 

mean_vect <- numeric(nsamp)  # create a vector to store the means

mean_vect[1] <- mean(sample_n(rdf, 1)$obs)  # mean of first sample as starting point

# start calculations to fifth sample iteratively
for (i in 2:nsamp) {
  # select the next sample
  next_samp <- sample_n(rdf, 1)
  # calculate the mean between the previous mean and the next sample
  mean_vect[i] <- mean(c(mean_vect[i-1], next_samp$obs))
}

# print the means
print(mean_vect)

[1] -1.13040590 -0.20491620  0.04831609  0.08284144  0.40170747

jay.sf · Answer 2 · 2023-02-21T06:45:59.053

You could define a recursive function (a function that calls itself).

f <- function(S, R, i=1, cm=NULL, res=NULL, ...) {
  S <- rbind(cm, rdf[sample.int(nrow(rdf), 1), ])
  cm <- colMeans(S)
  res <- rbind(res, cm)
  return(if (i < R) {
    f(S, R=R, i=i + 1, cm=cm, res=res)  ## also `Recall(.)` instead of `f(.)`
  } else {
    `rownames<-`(as.data.frame(res), NULL)
  })
}

set.seed(42)
f(rdf[sample.int(nrow(rdf), 1), ], R=10)
#             obs        time
# 1   0.376972125 -0.35312282
# 2  -1.209781097  0.01180847
# 3  -0.416404486 -0.17065718
# 4   0.671363430 -0.97981606
# 5   0.394365109 -0.21075628
# 6  -0.368020398 -0.04117009
# 7  -0.033236012  0.68404454
# 8   0.042065388  0.62117402
# 9   0.209518756  0.13402560
# 10 -0.009929495 -1.20236950

You probably have to increase you C stack size.

But you could also use a for loop.

R <- 10
res1 <- matrix(nrow=0, ncol=2)

set.seed(42)
for (i in seq_len(R - 1)) {
  if (nrow(res1) == 0) {
    res1 <- rdf[sample.int(nrow(rdf), 1), ]
  }
  S <- rdf[sample.int(nrow(rdf), 1), ]
  res1 <- rbind(res1, colMeans(rbind(res1[nrow(res1), ], S)))
}
res1
#             obs        time
# 1   0.376972125 -0.35312282
# 2  -1.209781097  0.01180847
# 3  -0.416404486 -0.17065718
# 4   0.671363430 -0.97981606
# 5   0.394365109 -0.21075628
# 6  -0.368020398 -0.04117009
# 7  -0.033236012  0.68404454
# 8   0.042065388  0.62117402
# 9   0.209518756  0.13402560
# 10 -0.009929495 -1.20236950

Here a quick benchmark of both versions (R=2K), recursion appears to be almost twice as fast.

# Unit: milliseconds
#      expr      min       lq     mean   median        uq       max neval cld
# recursive 577.0595 582.0189 587.3052 586.9783  592.4281  597.8778     3  a 
#  for-loop 991.4360 993.7170 997.2436 995.9980 1000.1473 1004.2966     3   b

Data:

rdf <- structure(list(obs = c(0.376972124936433, 0.301548373935665, 
-1.0980231706536, -1.13040590360378, -2.79653431987176, 0.720573498411587, 
0.93912102300901, -0.229377746707471, 1.75913134696347, 0.117366786802848
), time = c(-0.353122822287008, 1.40925918161821, 1.69637295955276, 
0.128416096258652, 0.376739766712564, 2.30004311672545, 2.20399587729432, 
-2.53876460529759, -1.78897494991878, 0.558303494992923)), class = "data.frame", row.names = c(NA, 
-10L))

I_O · Answer 3 · 2023-02-20T11:27:56.140

another approach (with your example data rdf):

create a function mean_of_random_pair(xs) which draws two random items of a set xs and calculates their mean:

mean_of_random_pair <- function(xs){
  xs |> sample(size = 2) |> mean(na.rm = TRUE)
}

create a function cumulative_mean which calculates the grand mean X as the mean of the existing X and a new item x:

cumulative_mean <- function(xs){
  xs |> Reduce(f = \(X, x) mean(c(X, x)),
               accumulate = TRUE
               )
}

link above functions up into a pipeline and run it runs times on the set rdf$obs:

runs = 100

1:runs |>
  Map(f = \(i) mean_of_random_pair(rdf$obs)) |>
  cumulative_mean()

output (the sequence of iterative averaging):

[1]  1.1000858  0.8557774  0.3041130  0.4262881 -0.4658256
# ...

inspect output (for n = 5000 simulation runs):

runs = 5e3
set.seed(4711)
densities <- 
  list(obs = 'obs', time = 'time') |>
  map(\(var){
    1:runs |>
      Map(f = \(i) mean_of_random_pair(rdf[[var]])) |>
      cumulative_mean() |>
      density()
  })

densities$time |> plot(col = 'blue', ylim = c(0, 1), xlim = c(-3, 3), main = 'foo')
densities$obs |> lines(col = 'red')

Calculate mean of random extract samples

3 Answers3