2

I have a data set that I would like to stratify sample, create statistical models on using the caret package and then generate predictions.

The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000).

What I want to be able to do is:

  1. Generate the stratified data sample
  2. Create the machine learning model
  3. Repeat 1000 times & take the average model output

I hope that by repeating the steps on the variations of the stratified data set, I am able to avoid the subtle changes in the predictions generated due to a smaller data sample.

For example, it may look something like this in r;

Original.Dataset = data.frame(A)

Stratified.Dataset = stratified(Original.Dataset, group = x)

Model = train(Stratified.Dataset.....other model inputs)

Repeat process with new stratified data set based on the original data and average out.

Thank you in advance for any help, or package suggestions that might be useful. Is it possible to stratify the sample in caret or simulate in caret?

Luigi Biagini
  • 71
  • 1
  • 5
JFG123
  • 577
  • 5
  • 13
  • What do you mean with: `The problem I am finding is that in different iterations of the stratified data set I get significantly different results` ? – nadizan Feb 20 '18 at 10:11
  • So for example, if I do set.seed(1) and create data set & run the analysis. And then I repeat and do set.seed(2), create the data set and run the exact same analysis. My predictions are significantly different. What I want to do is aggregate the results by running the analysis on stratified samples multiple times to 'smooth' over these discrepancies. – JFG123 Feb 20 '18 at 10:30
  • @JackFahey-Gilmour what you are describing is ensemble learning, and in particular, random forest. And its core is the fact that **your results will be significantly different in each sampling iteration**. That is the source of an innate power of ensemble models. Please read on ensemble learning or random forest in particular, things will be much clearer to you. Your question is way too broad right now. – FatihAkici Feb 21 '18 at 02:25

1 Answers1

1

First of all, welcome to SO.

It is hard to understand what you exactly are wondering, your question is very broad.

If you need input on statistics I would suggest you to ask more clearly defined questions in Cross Validated. Q&A for people interested in statistics, machine learning, data analysis, data mining, and data visualization.

The problem I am finding is that in different iterations of the stratified data set I get significantly different results (this may be in part due to the relatively small data sample M=1000).

I assume you are referring to different iterations of your model. This depends on how large your different groups are. E.g. if you are trying to divide your data set consisting of 1000 samples in to groups of 10 samples, your model could very likely be unstable and hence give different results in each iteration. This could also be due to that your model depends on some randomness, and the smaller your data is (and the more groups) your will have larger variation. See here or here for more information on cross validation, stability and bootstrap aggregating.

  1. Generate the stratified data sample

How to generate it: the dplyr package is excellent in grouping data depending on different variables. You might also want to use the split function found in the base package. See here for more information. You could also use the in-built methods found in the caret package, found here.

How to know how to split it: it very much depends on your question you would like to answer, most likely you would like to even out some variables, e.g. gender and age for creating a model for predicting disease. See here for more info.

In the case of having e.g. duplicated observations and you want to create unique subsets with different combinations of replicates with it's unique measurements you would have to use other methods. If the replicates have a common identifier, here sample_names. You could do something like this to select all samples but with different combinations of the replicates:

tg <- data.frame(sample_names = rep(1:5,each=2))
set.seed(10)
tg$values<-rnorm(10)

partition <- lapply(1:100, function(z) {
  set.seed(z)
  sapply(unique(tg$sample_names), function(x) {
    which(x == tg$sample_names)[sample(1:2, 1)]
  })
})

#the first partition of your data to train a model.
tg[partition[[1]],]
  1. Create the machine learning model

If you want to use caret, you could go to the caret webpage. And see all the available models. Depending on your research question and/or data you would like to use different types of models. Therefore, I would recommend you to take some online machine learning courses, for instance the Stanford University course given by Andrew Ng (I have taken it myself), to get more familiar with the different major algorithms.If you are familiar with the algorithms, just search for the available models.

  1. Repeat 1000 times & take the average model output

You can either repeat your model 1000 times with different seeds (see set.seed) and different training methods e.g. cross validations or bootstrap aggregation. There are a lot of different training parameters in the caret package:

The function trainControl generates parameters that further control how models are created, with possible values:

method: The resampling method: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob"

For more information on the methods, see here.

nadizan
  • 1,323
  • 10
  • 23
  • Hi @nadizan, thanks for your response. I do not think I was clear enough before. I have been using r and caret for a fair while now but have stumbled across this data set. I will explain it further here. My data set has examples where if one training example is included then another can not. Therefore some of the training examples are pairs, where only 1 can exist in the training data set. What I find is when I used different combinations of the data set (with only 1 of the pairs present) i get different results. – JFG123 Feb 21 '18 at 04:48
  • What I therefore want to do is do the stratification of the data set (so only 1 of the pairs is included in training) and then run the model on this. But for every repeat of the model use a different stratification of the data set. Is this possible to do inside of caret? That is my main question, can caret do stratification (with certain conditions such as this) inside its processing? Once again thanks for the help. #firsttimeuser – JFG123 Feb 21 '18 at 04:50
  • See updated answer, did I understand your question correctly? – nadizan Feb 21 '18 at 09:36
  • Hi @nadizan, yes! That is a really good way of creating the various training data sets. How would I then train a model on all of these different versions of the training data set? Is it possible to embed this formula into caret so that the model can be trained on all of the different versions and aggregated to give a final model output? Thanks for your time. – JFG123 Feb 21 '18 at 13:54
  • @JackFahey-Gilmour you could use the `trainControl` function in `caret` and add `index=partition`, so you have something like: `trainControl(method = "cv", index=partition, returnResamp="all")`. Then `model_iris<-train(Sepal.Length~Petal.Length+Petal.Width, data=iris,method = "lm", trControl = tmp)`. In the resulting `train` object you will find a list called `model_iris$resample` with the errors and metrics, BUT you will not get the coefficients. See the answer from the author of `caret` here: https://stackoverflow.com/questions/28303509/caret-coefficients-of-cross-validated-set. – nadizan Feb 21 '18 at 15:55
  • @JackFahey-Gilmour also, please accept (and upvote) my answer if you find it informative and helpful so other people can find this solution. – nadizan Feb 21 '18 at 15:56
  • really good suggestion I can see how it works. I have a question regarding your comment 'index=partition', this works, but if I do not set indexOut it will use 'the unique set of samples not contained in index' according to the caret documentation. Is there a way to specify that I want the cross validation to all occur in the one re-sample? Ie the held out set for cv comes from the re-sampled group. Otherwise, if indexOut is not set there is the potential for the duplicates to exist in the validation set during each cv iteration. That's how I interpreted it? – JFG123 Feb 21 '18 at 17:07
  • @JackFahey-Gilmour Yes, you interpreted it correct. You will have to specify the samples in `indexOut`. An alternative would be to perform different `train` for each partition e.g. in a for loop, and then save all the coefficients/parameters. That would probably be the easiest way to go. Once again, click on my main post and upvote and/or accept the answer if you find it helpful. More info here: https://stackoverflow.com/help/someone-answers Good luck! – nadizan Feb 21 '18 at 22:22