Calculating stats for random subsample using R loop

Question

I am trying to find a way in R to randomly subset some data (proportion of suitable habitat in an area for an ecological study), calculate a mean and proportion of samples with values > 0 and then save or append those values to a dataframe. I then want to repeat this a number of times (1000 for the example). Standard bootstrapping or resampling packages won't work as I need to calculate freq of occurance as well as the mean of the subsample. I'm aware of the "apply" functions, but those loop over the entire data frame whereas I'm trying to do it on a subsample repeated. I know I need some code to get the calculated values in the loop to save and output but having issues. "habprop" is a column in a dataframe ("data") that I want to calculate the mean and proportion of positive values for and save.

for(i in 1000 {  
randsample=data[sample(1:nrow(data), 50, replace=FALSE),]
m=mean(randsample$habprop)
randsamplepos=subset(randsample, habprop > 0)
habfreq=(nrow(randsamplepos)/nrow(randsample))
})

why wont standard bootstrap work? You can pass an arbitrary function to `boot` and return a list of values — Rorschach, Jun 15 '15 at 18:08
create an empty list `lst <- list()` outside of the loop, then at the bottom `lst[i] <- habfreq`. If `habfreq` is one value per iteration, you can simplify to `v1 <- c()` and v1[i] <- habfreq`. — Pierre L, Jun 15 '15 at 18:12
Also, why do you calculate `m` in the for loop 1000 times over if you aren't going to use it? — Pierre L, Jun 15 '15 at 18:14
It's best to create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output. Describe what you mean by "having issues" -- what exactly doesn't work? — MrFlick, Jun 15 '15 at 18:18

score 2 · Answer 1 · edited May 23 '17 at 11:51

2

How about the replicate function? This post looks pretty similar.

Generating some data to work on

data <- data.frame(x1=rpois(5000, 5), x2=runif(5000), x3=rnorm(5000))

Defining a function to sample and take means and counts

sample_stats <- function(df, n=100){
  df <- df[sample(1:nrow(df), n, replace=F),]
  mx1 <- mean(df$x1[df$x1>0])
  x1pos <- sum(df$x1>0)
  return(c(mx1, x1pos))
}

run it once just to see output

sample_stats(data)

run it 1000 times

results <- replicate(1000, sample_stats(data, n=100))

edited May 23 '17 at 11:51

Community

1
1

answered Jun 15 '15 at 18:19

ajb

692
6
16

thanks @ajb this code seems to work-do you know if in setting up the df in the function that "sample" pulls a randomized sample every time? – Kevin T Jun 16 '15 at 17:56
@kevinT Yes `sample` will pull a random sample every time. if you want to make results repeatable (ensure that you can generate the same random sample), you can use `set.seed` directly before you run `sample` – ajb Jun 16 '15 at 18:44

score 0 · Answer 2 · answered Jun 15 '15 at 18:30

0

Using boot this should be possible

dat <- data.frame(habprop=rnorm(100))

## Function to return statistics from subsamples
stat <- function(dat, inds)
    with(dat, c(mu=mean(habprop[inds]), freq=sum(habprop[inds] > 0)/length(inds)))

library(boot)
boot(data=dat, statistic=stat, R=1000)

# Bootstrap Statistics :
#        original      bias    std. error
# t1* -0.06154533 -0.00324393  0.08377116
# t2*  0.52000000 -0.00073000  0.04853991

answered Jun 15 '15 at 18:30

Rorschach

31,301
5
78
129

the way I understand the problem to be with bootstrapping in this case is that you are pulling from a distribution and calculating based on that (in this example a normal dist.)-as set up the first statistic is negative because of this, where all the values are positive. – Kevin T Jun 16 '15 at 17:53

Calculating stats for random subsample using R loop

2 Answers2