Training and test set with respect to group affiliation

Question

I’m using the following function in R to split subjects/samples into training and test set and it works really fine. However, in my dataset the subjects are divided into 2 groups (Patients and control subjects) and therefore, I wish to split the data while maintaining the proportions of patients and control subjects in each training and test set at the same ratio as in the complete data set. How can I do that in R? How can I modify the following function so that it will take into account group affiliation as it split the data into training and test set?

# splitdf function will return a list of training and testing sets#
splitdf <- function(dataframe, seed=NULL) { 
  if (!is.null(seed)) 
     set.seed(seed)

  index <- 1:nrow(dataframe)
  trainindex <- sample(index, trunc(length(index)/2))
  trainset <- dataframe[trainindex, ] 
  testset <- dataframe[-trainindex, ] 
  list(trainset=trainset,testset=testset) 
}

# apply the function
splits <- splitdf(Data, seed=808)

# it returns a list - two data frames called trainset and testset
str(splits)    

# there are "n" observations in each data frame
lapply(splits,nrow)   

# view the first few columns in each data frame
lapply(splits,head)   

# save the training and testing sets as data frames
training <- splits$trainset
testing <- splits$testset`

#

Example: use the built in iris data and split the dataset into training and testing sets. This dataset has 150 samples and has a factor called Species consisting of 3 levels (setosa, versicolor and virginica)

load the iris data

data(iris)

splits the dataset into training and testing sets:

splits <- splitdf(iris, seed=808)

str(splits)
lapply(splits,nrow)
lapply(splits,head)
training <- splits$trainset
testing <- splits$testset

As you can see here, the function “splitdf” does not take into account group affiliation “Species” when it splits the data into training and test set and as the result the number samples with respect to setosa, versicolor and virginica in the training and test set are Not proportional to that of the main dataset. So, How can I modify the function so that it will take into account group affiliation as it split the data into training and test set?

You should give a reproducible example and the expected output. — agstudy, Sep 22 '13 at 10:37
@agstudy Should I upload the head of the dataset to demonstrate how the datset looks like? — Payam Delfani, Sep 22 '13 at 10:43
No. (I did not downvote) you should read [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) maybe. — agstudy, Sep 22 '13 at 10:48
@agstudy Thanks for clarifying that point. Now I have edited the questions and added one example there to demonstrate the problem. Hope that someone can help me out! — Payam Delfani, Sep 22 '13 at 11:20

Nate Pope · Accepted Answer · 2013-09-22T19:21:13.073

0

Here is a solution using plyr with a simulated dataset.

library(plyr)
set.seed(1001)
dat = data.frame(matrix(rnorm(1000), ncol = 10), treatment = sample(c("control", "control", "treatment"), 100, replace = T) )

# divide data set into training and test sets
tr_prop = 0.5    # proportion of full dataset to use for training
training_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)

# check that proportions are equal across datasets
ddply(dat, .(treatment), function(.) nrow(.)/nrow(dat) )
ddply(training_set, .(treatment), function(.) nrow(.)/nrow(training_set) )
ddply(test_set, .(treatment), function(.) nrow(.)/nrow(test_set) )
c(nrow(training_set), nrow(test_set), nrow(dat)) # lengths of sets

Here, I use set.seed() to ensure identical behavior of sample() when constructing the training/test sets with ddply. This strikes me as a bit of a hack; perhaps there is another way to achieve the same result using a single call to **ply (but returning two dataframes). Another option (without egregious use of set.seed) would be to use dlply and then piece together elements of the resulting list into training/test sets:

set.seed(101) # for consistancy with 'ddply' above
split_set = dlply(dat, .(treatment), function(.) { s = sample(1:nrow(.), trunc(nrow(.) * tr_prop)); list(.[s, ], .[-s,]) } )
# join together with ldply()
training_set = ldply(split_set, function(.) .[[1]])
test_set = ldply(split_set, function(.) .[[2]])

edited Sep 22 '13 at 19:21

answered Sep 22 '13 at 18:42

Nate Pope

1,696
12
11

Thanks for the answer, I think it works fine. However for some reasons the rownames of my data set are replaced with numbers and as the result I don't know what samples are used in each training and test set. Do you have any suggestions about why that is so ? Thanks. – Payam Delfani Sep 22 '13 at 21:24
I believe this happens because `ddply` creates an entirely new dataframe with the results of the function. I would just add a new column with your sample IDs to the original data, and pass to `ddply` or `dlply` as above. – Nate Pope Sep 22 '13 at 22:07
Thanks. But, the original data contains sample IDs (S01-S100) but they are replaced by numbers as I pass to ddply. – Payam Delfani Sep 22 '13 at 22:27
Try converting the column of sample IDs to a character vector using `as.character` prior to passing to `ddply`. If they are passed as factors, they may get converted to their numeric equivalent (ie. "S01" could end up as "1") – Nate Pope Sep 22 '13 at 22:34
the sample IDs are character vector. I read in my data set as follow `dat<- read.table(file="myData.txt", sep="\t", row.names=1, header=TRUE)` I check that by `is.character(rownames(dat))` and the answer is TRUE. – Payam Delfani Sep 22 '13 at 22:45
My initial suggestion was to create a separate column for the rownames, ie.: `dat$sample_id <- rownames(dat)` and then pass `dat` to `ddply`. The *rownames* of `dat`, which are a property of a data frame, are lost in a call to `ddply`. This works fine for me. If you want these IDs as rownames in your test/training sets, you can always use `rownames(training_set) <- training_set$sample_id` – Nate Pope Sep 22 '13 at 22:51
when I construct the training and test sets using 2 different strategies (ddply and dlply) you described above, I couldn't really reproduce the same output despite the fact that I set seed to 101 in both cases. Do you have any suggestions about why that is so ? Thanks – Payam Delfani Sep 26 '13 at 12:49
@PayamDelfani Yes, it is because `set.seed` is *within* the function in the `ddply` method, and outside the function in the `dlply` method. So with `ddply`, the same seed is set *each* time `ddply` splits the data frame before the call to `sample`. To make the first example consistent with the second, I believe you could move `set.seed` from within each call to `ddply`, to right before each call to `ddply`. – Nate Pope Sep 26 '13 at 17:10
@Nape Pope, Thanks for clarifying that point. It seems to me that using ddply is a bit dangerous function as failing to set seed before each call to ddply will result in a situation that some samples will be used both in training and test sets. In other words, the example 2 you set here using dlply is relatively more safer than ddply. – Payam Delfani Sep 27 '13 at 11:27
@Nape Pope, By the way, what's the difference between these two conditions: set seed outside the function and within a function.Is it not the case that in both cases we set a seed anyhow? why should we except a different output! – Payam Delfani Sep 27 '13 at 11:54
@PayamDelfani I agree that `dlply` as I have used it here is safer, although as long as the same seed is passed to the `ddply` example, it should work fine (the only reason to pull the seed outside of the function is to check for consistency with `dlply`). – Nate Pope Sep 27 '13 at 15:10
@PayamDelfani When we `set.seed` inside the function given to `ddply`, the seed is set twice (in this example) for each call to `ddply` -- twice for the test set and twice for the training set. First, when the data are subset to `treatment` and second when the data are subset to `control`. When we `set.seed` *before* the call to `ddply`, we are setting a seed only once. – Nate Pope Sep 27 '13 at 15:12
Thank you for clarifying that point! By the way, I've posted a new question regarding "Nested Loop Cross Validation for Classication using nlcv package". Please let me know if have any suggestions or recommendations on any aspects of this issue. – Payam Delfani Sep 30 '13 at 21:26

Training and test set with respect to group affiliation

load the iris data

splits the dataset into training and testing sets:

1 Answers1