-2

Currently, I am attempting to bootstrap a dataset with 114 obs and 16 variables.

I have used the sample function as follows :

x[sample(nrow(x),size=114,replace=TRUE),] where x is my dataset. 

However, I would like to sample with probabilities assigned to specific columns as the sample function contains this possibility. For example I would like to sample the 5th column with probability of numbers between 1-5 0.1 and numbers 5-200 0.9.

How would I go about this?

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
  • When asking for help you should include a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and the desired output. How exactly are you "storing" these requirements? – MrFlick Feb 24 '17 at 18:42
  • @MrFlick Thanks for your help, I am sorting these columns by day currently as it is Battle of Britain data. Therefore I have columns which are currently the individual days and the planes that went down, and the location they attacked. So I effectively looking to resample the data to create counterfactual history. So for example how would it affected the British losses if Germany only attacked London. – E.Richards Feb 28 '17 at 13:11

1 Answers1

0

If I'm understanding what you're looking for, this might be a job for mapply.

# fake data
x <- as.data.frame(matrix(1:10,nrow=2))
x
  V1 V2
1  1  6
2  2  7
3  3  8
4  4  9
5  5 10

# fake probabilities of each row, for each column
probs <- as.data.frame(matrix(c(.1,.1,.1,.2,.5,.5,.2,.1,.1,.1),ncol=2))
probs
   V1  V2
1 0.1 0.5
2 0.1 0.2
3 0.1 0.1
4 0.2 0.1
5 0.5 0.1

# then a little mapply magic - change size as needed
mapply(sample, x=x, prob=probs, replace=T, size=10)
      V1 V2
 [1,]  5  7
 [2,]  5  7
 [3,]  2  6
 [4,]  5  6
 [5,]  1  6
 [6,]  5  9
 [7,]  1  9
 [8,]  5  9
 [9,]  5  6
[10,]  4  7
Matt Tyers
  • 2,125
  • 1
  • 14
  • 23
  • Thanks for your response Matt, is it possible to perform this operation on one column of the data and keep the rows (instances) together. – E.Richards Feb 27 '17 at 19:16
  • I don't believe it would be possible to keep the rows (instances) together, and at the same time have different row probabilities associated with each column. If you just want different row probabilities, then the solution is actually simpler than presented above. Using your original probabilities, you could do `p <- c(rep(0.1,5),rep(0.9,195))`, then `x[sample(nrow(x),size=114,replace=TRUE,prob=p),]` – Matt Tyers Feb 27 '17 at 19:24
  • Thanks Matt, that is logical. However, my question is more is it possible to align the probabilities you stated c(rep(0.1,5),rep(0.9,195)) to the specific columns and keep the rows together. For example, could you assign this probability to column 4 for the sample function? Thanks for your help – E.Richards Feb 28 '17 at 13:03