3

I am trying to extract random samples from 2 columns of my database (hours of work and relative amount of patients visited), and then I would like to calculate the mean progressively. By that I mean, the mean between the firsts 2 samples, then the mean between the mean I just calculated and the third sample...and so on.

Is it possible? Is there a function for that?

Thank you all for the help.

L.

This is how I am extracting the samples.

library(dplyr)

set.seed(2020)
obs <- rnorm(10, mean = 0, sd = 1)
time <- rnorm(10, mean = 0.5, sd = 1)
rdf <- data.frame(obs, time)
sample_n(rdf, 1)

p <- replicate(100, expr = (sample_n(rdf, 1) + sample_n(rdf, 1))/2)
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • People often use words like progressive and dynamic which can mean almost anything and therefore, ... nothing. This case calls for the use of the term "recursive". – IRTFM Feb 20 '23 at 03:19
  • @IRTFM Any ideas why recursion is **faster** than a loop in [this](https://stackoverflow.com/a/75506559/6574038) case below? – jay.sf Feb 20 '23 at 07:56
  • 1
    @jay.sf. Not sure. But I’m also wondering if the use of `Recall` might improve the performance of the recursion approach. – IRTFM Feb 20 '23 at 16:18
  • @IRTFM `Recall` didn't change the performance noticeable. But a very interesting feature that keeps the recursive function working if you rename it. Thanks! – jay.sf Feb 21 '23 at 06:44

3 Answers3

1

One option is to use a for loop and determine the number of samples you would like. For example if we want to take 5 samples and calculate the means progressively we could do a loop which starts with first sample and iteratively selects the next sample. Then calculates the mean between the previous mean and the next sample:

set.seed(2020)
obs <- rnorm(10, mean = 0, sd = 1)
time <- rnorm(10, mean = 0.5, sd = 1)
rdf <- data.frame(obs, time)

nsamp <- 5  # number of samples 

mean_vect <- numeric(nsamp)  # create a vector to store the means

mean_vect[1] <- mean(sample_n(rdf, 1)$obs)  # mean of first sample as starting point

# start calculations to fifth sample iteratively
for (i in 2:nsamp) {
  # select the next sample
  next_samp <- sample_n(rdf, 1)
  # calculate the mean between the previous mean and the next sample
  mean_vect[i] <- mean(c(mean_vect[i-1], next_samp$obs))
}

# print the means
print(mean_vect)

[1] -1.13040590 -0.20491620  0.04831609  0.08284144  0.40170747
S-SHAAF
  • 1,863
  • 2
  • 5
  • 14
1

You could define a recursive function (a function that calls itself).

f <- function(S, R, i=1, cm=NULL, res=NULL, ...) {
  S <- rbind(cm, rdf[sample.int(nrow(rdf), 1), ])
  cm <- colMeans(S)
  res <- rbind(res, cm)
  return(if (i < R) {
    f(S, R=R, i=i + 1, cm=cm, res=res)  ## also `Recall(.)` instead of `f(.)`
  } else {
    `rownames<-`(as.data.frame(res), NULL)
  })
}

set.seed(42)
f(rdf[sample.int(nrow(rdf), 1), ], R=10)
#             obs        time
# 1   0.376972125 -0.35312282
# 2  -1.209781097  0.01180847
# 3  -0.416404486 -0.17065718
# 4   0.671363430 -0.97981606
# 5   0.394365109 -0.21075628
# 6  -0.368020398 -0.04117009
# 7  -0.033236012  0.68404454
# 8   0.042065388  0.62117402
# 9   0.209518756  0.13402560
# 10 -0.009929495 -1.20236950

You probably have to increase you C stack size.

But you could also use a for loop.

R <- 10
res1 <- matrix(nrow=0, ncol=2)

set.seed(42)
for (i in seq_len(R - 1)) {
  if (nrow(res1) == 0) {
    res1 <- rdf[sample.int(nrow(rdf), 1), ]
  }
  S <- rdf[sample.int(nrow(rdf), 1), ]
  res1 <- rbind(res1, colMeans(rbind(res1[nrow(res1), ], S)))
}
res1
#             obs        time
# 1   0.376972125 -0.35312282
# 2  -1.209781097  0.01180847
# 3  -0.416404486 -0.17065718
# 4   0.671363430 -0.97981606
# 5   0.394365109 -0.21075628
# 6  -0.368020398 -0.04117009
# 7  -0.033236012  0.68404454
# 8   0.042065388  0.62117402
# 9   0.209518756  0.13402560
# 10 -0.009929495 -1.20236950

Here a quick benchmark of both versions (R=2K), recursion appears to be almost twice as fast.

# Unit: milliseconds
#      expr      min       lq     mean   median        uq       max neval cld
# recursive 577.0595 582.0189 587.3052 586.9783  592.4281  597.8778     3  a 
#  for-loop 991.4360 993.7170 997.2436 995.9980 1000.1473 1004.2966     3   b

Data:

rdf <- structure(list(obs = c(0.376972124936433, 0.301548373935665, 
-1.0980231706536, -1.13040590360378, -2.79653431987176, 0.720573498411587, 
0.93912102300901, -0.229377746707471, 1.75913134696347, 0.117366786802848
), time = c(-0.353122822287008, 1.40925918161821, 1.69637295955276, 
0.128416096258652, 0.376739766712564, 2.30004311672545, 2.20399587729432, 
-2.53876460529759, -1.78897494991878, 0.558303494992923)), class = "data.frame", row.names = c(NA, 
-10L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

another approach (with your example data rdf):

  • create a function mean_of_random_pair(xs) which draws two random items of a set xs and calculates their mean:
mean_of_random_pair <- function(xs){
  xs |> sample(size = 2) |> mean(na.rm = TRUE)
}
  • create a function cumulative_mean which calculates the grand mean X as the mean of the existing X and a new item x:
cumulative_mean <- function(xs){
  xs |> Reduce(f = \(X, x) mean(c(X, x)),
               accumulate = TRUE
               )
}

link above functions up into a pipeline and run it runs times on the set rdf$obs:

runs = 100

1:runs |>
  Map(f = \(i) mean_of_random_pair(rdf$obs)) |>
  cumulative_mean()

output (the sequence of iterative averaging):

[1]  1.1000858  0.8557774  0.3041130  0.4262881 -0.4658256
# ...

inspect output (for n = 5000 simulation runs):

runs = 5e3
set.seed(4711)
densities <- 
  list(obs = 'obs', time = 'time') |>
  map(\(var){
    1:runs |>
      Map(f = \(i) mean_of_random_pair(rdf[[var]])) |>
      cumulative_mean() |>
      density()
  })

densities$time |> plot(col = 'blue', ylim = c(0, 1), xlim = c(-3, 3), main = 'foo')
densities$obs |> lines(col = 'red')

enter image description here

I_O
  • 4,983
  • 2
  • 2
  • 15