0

I would like to resample data with weighted bootstrap for constructing random forest.

The situation is like that.

I have the data which consist of normal subjects(N=20000) and patients(N=500).

I made new data set with normal subjects (N=2000) and patients (n=500) because I conducted a certain experiment with subjects (N=2500).

As you can see, normal subjects extracted 1/10 of original data and patients extracted all of them.

Therefore, I should give a weight to normal subjects to perform machine learning algorithm.

Please let me know how I can bootstrap with weight in R.

Thank you.

Emmanuel-Lin
  • 1,848
  • 1
  • 16
  • 31
SJUNLEE
  • 167
  • 2
  • 14
  • 1
    Your question is way to general. What code have you written so far? Could you produce a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? What is your expected result? – Emmanuel-Lin Jul 03 '18 at 12:21
  • @Emmanuel-Lin Sorry about that. I implemented random forest model based on the case-cohort study deisgn. Because of this study degisn, I need to bootstrap with weights. On the other hand, I have no idea how I can do this. Therefore, I cannot suggest reproducible example with code. Thank you – SJUNLEE Jul 06 '18 at 07:19

1 Answers1

2

It sounds like you really need to stratified resampling rather than weighted resampling.

Your data are structured into two different groups of different sizes, and you would like to preserve that structure in your bootstrap. You didn't say what function you were applying to these data, so lets use something simple like the mean.

Generate some fake data, and take the (observed) means:

controls <- rnorm(2000, mean = 10)
patients <- rnorm(500, mean = 9.7)

mean(controls)
mean(patients)

Tell R we want to perform 200 bootstraps, and set up two empty vectors to store means for each bootstrap sample:

nbootraps <- 200
boot_controls <- numeric(nbootraps)
boot_patients <- numeric(nbootraps)

Using a loop we can draw resamples of the same size as you have in the original sample, and calculate the means for each.

for(i in 1:nbootraps){
  # draw bootstrap sample
  new_controls <- controls[sample(1:2000, replace = TRUE)]
  new_patients <- patients[sample(1:500,  replace = TRUE)]

  # send the mean of each bootstrap sample to boot_ vectors
  boot_controls[i] <- mean(new_controls)
  boot_patients[i] <- mean(new_patients)
}

Finally, plot the bootstrap distributions for group means:

p1 <- hist(boot_controls)
p2 <- hist(boot_patients)

plot(p1, col=rgb(0,0,1,1/4), xlim = c(9.5,10.5), main="")
plot(p2, col=rgb(1,0,0,1/4), add=T)
tellis
  • 152
  • 6
  • However, I would like to build the random forest model based on the case-cohort study. Because of this study design, I need to bootstrap with weights to case subjects. Could you explain how I can do this? Thank you so much. – SJUNLEE Jul 06 '18 at 07:23
  • In that case it is not clear what you mean - what do you want to weight, and what should the weightings be? – tellis Jul 11 '18 at 05:45
  • I have a case-cohort study design which is what I've mentioned in my queston. Therefore, I have already "weights" value for each subject. For example, when I analysis data for hazard ratio, I should use "weighted" cox hazard regression model. Now I would like to build random survival forest model. In this process, I would like to bootstrap the original data which is reflected on weights value to make "in-bag" data for building the forest model. On the other hand, I have no idea how to do this. Thank you. – SJUNLEE Jul 12 '18 at 08:56
  • This doesn't really help understanding what your weights are, but you probably want to specify weights using the `prob` argument in `sample()`. Have a look at ?sample. – tellis Jul 16 '18 at 10:56
  • Thank you! it works pretty well and this is what I wanted. – SJUNLEE Jul 17 '18 at 13:02