First of all, welcome to SO.
It is hard to understand what you exactly are wondering, your question is very broad.
If you need input on statistics I would suggest you to ask more clearly defined questions in Cross Validated.
Q&A for people interested in statistics, machine learning, data analysis, data mining, and data visualization.
The problem I am finding is that in different iterations of the
stratified data set I get significantly different results (this may be
in part due to the relatively small data sample M=1000).
I assume you are referring to different iterations of your model. This depends on how large your different groups are. E.g. if you are trying to divide your data set consisting of 1000 samples in to groups of 10 samples, your model could very likely be unstable and hence give different results in each iteration. This could also be due to that your model depends on some randomness, and the smaller your data is (and the more groups) your will have larger variation. See here or here for more information on cross validation, stability and bootstrap aggregating.
- Generate the stratified data sample
How to generate it: the dplyr
package is excellent in grouping data depending on different variables. You might also want to use the split
function found in the base
package. See here for more information. You could also use the in-built methods found in the caret
package, found here.
How to know how to split it: it very much depends on your question you would like to answer, most likely you would like to even out some variables, e.g. gender and age for creating a model for predicting disease. See here for more info.
In the case of having e.g. duplicated observations and you want to create unique subsets with different combinations of replicates with it's unique measurements you would have to use other methods. If the replicates have a common identifier, here sample_names
. You could do something like this to select all samples but with different combinations of the replicates:
tg <- data.frame(sample_names = rep(1:5,each=2))
set.seed(10)
tg$values<-rnorm(10)
partition <- lapply(1:100, function(z) {
set.seed(z)
sapply(unique(tg$sample_names), function(x) {
which(x == tg$sample_names)[sample(1:2, 1)]
})
})
#the first partition of your data to train a model.
tg[partition[[1]],]
- Create the machine learning model
If you want to use caret
, you could go to the caret webpage. And see all the available models. Depending on your research question and/or data you would like to use different types of models. Therefore, I would recommend you to take some online machine learning courses, for instance the Stanford University course given by Andrew Ng (I have taken it myself), to get more familiar with the different major algorithms.If you are familiar with the algorithms, just search for the available models.
- Repeat 1000 times & take the average model output
You can either repeat your model 1000 times with different seeds (see set.seed
) and different training methods e.g. cross validations or bootstrap aggregation. There are a lot of different training parameters in the caret
package:
The function trainControl
generates parameters that further control
how models are created, with possible values:
method: The resampling method: "boot", "cv", "LOOCV", "LGOCV",
"repeatedcv", "timeslice", "none" and "oob"
For more information on the methods, see here.