200

I've just started using R and I'm not sure how to incorporate my dataset with the following sample code:

sample(x, size, replace = FALSE, prob = NULL)

I have a dataset that I need to put into a training (75%) and testing (25%) set. I'm not sure what information I'm supposed to put into the x and size? Is x the dataset file, and size how many samples I have?

Wael
  • 1,640
  • 1
  • 9
  • 20
Susie Humby
  • 2,001
  • 2
  • 13
  • 3
  • 1
    `x` can be the index (row/col nos. say) of your `data`. `size` can be `0.75*nrow(data)`. Try `sample(1:10, 4, replace = FALSE, prob = NULL)` to see what it does. – harkmug Jun 19 '13 at 20:09

28 Answers28

314

There are numerous approaches to achieve data partitioning. For a more complete approach take a look at the createDataPartition function in the caret package.

Here is a simple example:

data(mtcars)

## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)

train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]
robertspierre
  • 3,218
  • 2
  • 31
  • 46
dickoa
  • 18,217
  • 3
  • 36
  • 50
  • 2
    I'm a little confused what guarantees this code returns a unique test and train df? It seems to work, don't get me wrong. Just having trouble understanding how subtracting the indices leads to unique observations. For instance, if you had a df with 10 rows and one column, and the one column contained 1,2,3,4,5,6,7,8,9,10 and you followed this code, what prevents a train having index 4 and test having -6 -> 10 - 6 = 4 as well? – goldisfine May 05 '14 at 13:09
  • 2
    thank. I tried `mtcars[!train_ind]` and while it didn't fail, it did't work as expected. How could I subset using the `!`? – user989762 Apr 23 '15 at 07:22
  • 1
    @user989762 `!` are used for logical (`TRUE/FALSE`) and not indices. If you want to subset using `!`, try something like mtcars[`!seq_len(nrow(mtcars)) %in% train_ind`, ] (not tested). – dickoa Apr 23 '15 at 16:09
  • 2
    @VedaadShakib when you use "-" it omit all the index in train_ind from your data. Take a look at http://adv-r.had.co.nz/Subsetting.html . Hope it helps – dickoa Aug 01 '16 at 22:05
  • Indexing by `-train_ind` is doesn't do what he claims. The minus does an elementwise negation of the numbers rather than selecting for those indices in `train` not picked by `train_ind`. Not sure how this has 93 upvotes since train and test are not split sets of mtcars. Wasted half hour on this and had to find a different solution. Maybe it's a special legacy artifact that's since changed? – Eric Leschinski Mar 30 '17 at 03:38
  • @EricLeschinski Can you elaborate with some code or a gist on github ? there are probably other elegant way to do it. It is a 3 years old answer probably few things have changed but I don't really get what you are saying. Can you post it on a gist so that I will understand better why negative index doesn't work in this case ? If you have a better (universal) way, I would gladly edit my answer. Best – dickoa Mar 31 '17 at 16:23
  • @ChristopherJohn Why it doesn't work can you explain with an example using a gist for example. I really want to understand to improve this answer from 2013. Thanks – dickoa Aug 05 '17 at 15:23
  • @ChristopherJohn Thanks for your comment. I am indexing by numbers not row names, I don't understand why you want to use row names instead of integer to index a dataframe? (http://www.perfectlyrandom.org/2015/06/16/never-trust-the-row-names-of-a-dataframe-in-R/). For the point about the size of the data, I don't get why it will not work on large dataset, any reproductible example you can share with me, I fail to understand but I really want to improve this answer. Lastly, if you want me to remove the floor argument you will replace it by what ? (ceiling? round?). Cheers – dickoa Aug 08 '17 at 15:20
  • I have deleted my comments, I cannot reproduce the problem I was having previously. Apologies. Thanks. – Christopher John Aug 10 '17 at 13:14
  • @ChristopherJohn Thanks for the comment and I will look into ways to improve this answer, probably using the new modelr package (https://github.com/tidyverse/modelr). – dickoa Aug 10 '17 at 19:06
  • What if we need a 5 fold or 10 fold validation? how do we make non overlapping test sets? – akang Jan 09 '18 at 02:21
  • This is the best solution. – Jinhua Wang Sep 18 '19 at 09:51
  • 5
    Isn't ```createDataPartition``` in ```caret``` and not ```caTools```? – J. Mini Apr 30 '20 at 20:12
  • Can someone explain me why do we set the seed? what does it do? – coder_bg Jan 13 '22 at 13:40
119

It can be easily done by:

set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

By using caTools package:

require(caTools)
set.seed(101) 
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)
TheMI
  • 1,715
  • 1
  • 14
  • 13
  • 6
    I recently did a course with MIT and they used the approach using caTools throughout. Thanks – Chetan Sharma Oct 26 '17 at 08:42
  • 2
    `sample = sample.split(data[,1], SplitRatio = .75)` Should remove the need to name a column. – Benjamin Ziepert Jun 04 '20 at 20:26
  • https://github.com/cran/caTools/blob/master/R/sample.split.R "Split data from vector Y into 2 bins in predefined ratio while preserving relative [ratios] of different labels in Y." So if this is for a classification problem, the first parameter should the column with the classes to be predicted. If it's a regression problem, might as well use the built-in sample function as in the former solution. – Shri Samson Oct 21 '20 at 07:58
42

I would use dplyr for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn't contain already.

mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test  <- dplyr::anti_join(mtcars, train, by = 'id')
Edwin
  • 3,184
  • 1
  • 23
  • 25
31

This is almost the same code, but in more nice look

bound <- floor((nrow(df)/4)*3)         #define % of training and test set

df <- df[sample(nrow(df)), ]           #sample rows 
df.train <- df[1:bound, ]              #get training set
df.test <- df[(bound+1):nrow(df), ]    #get test set
Katerina
  • 2,580
  • 1
  • 22
  • 25
  • Yup! Nice look! – MS Sankararaman Sep 25 '18 at 09:50
  • does this randomly pick data? is the `sample` method provided built in ? – Regressor Sep 23 '20 at 16:11
  • Yes, it does at the step `df <- df[sample(nrow(df)), ]` where all the rows are randomly sampled. Take a look at https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/sample for more information of `sample` function which is the R base function. – Spacez Mar 13 '23 at 17:26
24
library(caret)
intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE)
training<-m_train[intrain,]
testing<-m_train[-intrain,]
pradnya chavan
  • 277
  • 3
  • 5
  • 6
    While a code-only answer is an answer, it is better to provide some explanation. – C8H10N4O2 Feb 12 '16 at 01:39
  • 1
    what is m_train? I think you meant, sub_train the original data.frame. Therefore, the revised code should be training<-sub_train[intrain,] and testing<-sub_train[-intrain,]. I wonder why nobody was able to spot this major problem with you answer in the past five years! – mnm Jul 27 '16 at 01:30
  • OP didn't explicitly asked for stratification... – Johannes Wiesner Dec 01 '21 at 10:52
22

I will split 'a' into train(70%) and test(30%)

    a # original data frame
    library(dplyr)
    train<-sample_frac(a, 0.7)
    sid<-as.numeric(rownames(train)) # because rownames() returns character
    test<-a[-sid,]

done

hyunwoo jeong
  • 1,534
  • 1
  • 15
  • 14
  • 4
    you need to import dpyr package, require(dplyr) – TheMI Jan 28 '16 at 06:20
  • This answer helped me but I did need to tweak it to get expected results. As is, the dataset 'train' has rownames = sid of sequential integers: 1,2,3,4,... whereas you want sid to be the rownumbers from the original dataset 'a,' which since they are randomly selected won't be the sequential integers. So, it's necessary to create the id variable on 'a' first. – Scott Murff Oct 21 '16 at 02:34
  • row.names(mtcars) <- NULL; train<-dplyr::sample_frac(mtcars, 0.5); test<-mtcars[-as.numeric(row.names(train)),] # I did this to my data, the original code doesn't work if your row names are set to numbers already – Christopher John Aug 04 '17 at 07:56
  • it looks ok for me, but do not seems to be random selection – B_slash_ Oct 19 '22 at 15:35
17

My solution is basically the same as dickoa's but a little easier to interpret:

data(mtcars)
n = nrow(mtcars)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = mtcars[trainIndex ,]
test = mtcars[-trainIndex ,]
Morgan Ball
  • 760
  • 9
  • 23
Alex
  • 12,078
  • 6
  • 64
  • 74
15

I can suggest using the rsample package:

# choosing 75% of the data to be the training data
data_split <- initial_split(data, prop = .75)
# extracting training data and test data as two seperate dataframes
data_train <- training(data_split)
data_test  <- testing(data_split)
10

After looking through all the different methods posted here, I didn't see anyone utilize TRUE/FALSE to select and unselect data. So I thought I would share a method utilizing that technique.

n = nrow(dataset)
split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.75, 0.25))

training = dataset[split, ]
testing = dataset[!split, ]

Explanation

There are multiple ways of selecting data from R, most commonly people use positive/negative indices to select/unselect respectively. However, the same functionalities can be achieved by using TRUE/FALSE to select/unselect.

Consider the following example.

# let's explore ways to select every other element
data = c(1, 2, 3, 4, 5)


# using positive indices to select wanted elements
data[c(1, 3, 5)]
[1] 1 3 5

# using negative indices to remove unwanted elements
data[c(-2, -4)]
[1] 1 3 5

# using booleans to select wanted elements
data[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5

# R recycles the TRUE/FALSE vector if it is not the correct dimension
data[c(TRUE, FALSE)]
[1] 1 3 5
Joe
  • 138
  • 1
  • 5
7

Just a more brief and simple way using awesome dplyr library:

library(dplyr)
set.seed(275) #to get repeatable data

data.train <- sample_frac(Default, 0.7)

train_index <- as.numeric(rownames(data.train))
data.test <- Default[-train_index, ]
Union find
  • 7,759
  • 13
  • 60
  • 111
Shayan Amani
  • 5,787
  • 1
  • 39
  • 40
6

scorecard package has a useful function for that, where you can specify the ratio and seed

library(scorecard)

dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)

The test and train data are stored in a list and can be accessed by calling dt_list$train and dt_list$test

camnesia
  • 2,143
  • 20
  • 26
5

If you type:

?sample

If will launch a help menu to explain what the parameters of the sample function mean.

I am not an expert, but here is some code I have:

data <- data.frame(matrix(rnorm(400), nrow=100))
splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4))))
test <- splitdata[[1]]
train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])

This will give you 75% train and 25% test.

navyjeff
  • 3
  • 4
user2502836
  • 703
  • 2
  • 6
  • 6
4

My solution shuffles the rows, then takes the first 75% of the rows as train and the last 25% as test. Super simples!

row_count <- nrow(orders_pivotted)
shuffled_rows <- sample(row_count)
train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),]
test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]
Johnny V
  • 1,108
  • 14
  • 21
3

We can divide data into a particular ratio here it is 80% train and 20% in a test dataset.

ind <- sample(2, nrow(dataName), replace = T, prob = c(0.8,0.2))
train <- dataName[ind==1, ]
test <- dataName[ind==2, ]
Adarsh Pawar
  • 682
  • 6
  • 15
  • 1
    Be careful as this is based on probability, so using this will not always create an exact 80-20% split. Easiest is just to use `createDataPartition` from the `caret` package as mentioned above. For instance, after just running it 3 times consecutively I got 45 train samples and just 5 test out of 50 (90-10). – user21398 Jan 14 '21 at 01:16
2

Below a function that create a list of sub-samples of the same size which is not exactly what you wanted but might prove usefull for others. In my case to create multiple classification trees on smaller samples to test overfitting :

df_split <- function (df, number){
  sizedf      <- length(df[,1])
  bound       <- sizedf/number
  list        <- list() 
  for (i in 1:number){
    list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
  }
  return(list)
}

Example :

x <- matrix(c(1:10), ncol=1)
x
# [,1]
# [1,]    1
# [2,]    2
# [3,]    3
# [4,]    4
# [5,]    5
# [6,]    6
# [7,]    7
# [8,]    8
# [9,]    9
#[10,]   10

x.split <- df_split(x,5)
x.split
# [[1]]
# [1] 1 2

# [[2]]
# [1] 3 4

# [[3]]
# [1] 5 6

# [[4]]
# [1] 7 8

# [[5]]
# [1] 9 10
Yohan Obadia
  • 2,552
  • 2
  • 24
  • 31
2

Use caTools package in R sample code will be as follows:-

data
split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)
Yash Sharma
  • 142
  • 1
  • 8
2

Use base R. Function runif generates uniformly distributed values from 0 to 1.By varying cutoff value (train.size in example below), you will always have approximately the same percentage of random records below the cutoff value.

data(mtcars)
set.seed(123)

#desired proportion of records in training set
train.size<-.7
#true/false vector of values above/below the cutoff above
train.ind<-runif(nrow(mtcars))<train.size

#train
train.df<-mtcars[train.ind,]


#test
test.df<-mtcars[!train.ind,]
  • This would be a much better answer if it showed the extra couple lines to actually create the training and test sets (which newbies often struggle with). – Gregor Thomas Jan 26 '18 at 14:55
2
require(caTools)

set.seed(101)            #This is used to create same samples everytime

split1=sample.split(data$anycol,SplitRatio=2/3)

train=subset(data,split1==TRUE)

test=subset(data,split1==FALSE)

The sample.split() function will add one extra column 'split1' to dataframe and 2/3 of the rows will have this value as TRUE and others as FALSE.Now the rows where split1 is TRUE will be copied into train and other rows will be copied to test dataframe.

thor
  • 21,418
  • 31
  • 87
  • 173
Abhishek
  • 588
  • 5
  • 7
2

Assuming df is your data frame, and that you want to create 75% train and 25% test

all <- 1:nrow(df)
train_i <- sort(sample(all, round(nrow(df)*0.75,digits = 0),replace=FALSE))
test_i <- all[-train_i]

Then to create a train and test data frames

df_train <- df[train_i,]
df_test <- df[test_i,]
Marcello B.
  • 4,177
  • 11
  • 45
  • 65
Corentin
  • 21
  • 1
  • 3
1

Beware of sample for splitting if you look for reproducible results. If your data changes even slightly, the split will vary even if you use set.seed. For example, imagine the sorted list of IDs in you data is all the numbers between 1 and 10. If you just dropped one observation, say 4, sampling by location would yield a different results because now 5 to 10 all moved places.

An alternative method is to use a hash function to map IDs into some pseudo random numbers and then sample on the mod of these numbers. This sample is more stable because assignment is now determined by the hash of each observation, and not by its relative position.

For example:

require(openssl)  # for md5
require(data.table)  # for the demo data

set.seed(1)  # this won't help `sample`

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
sample2 <- sample1[-sample(N, 1)]  # randomly drop one observation from sample1

# samples are all but identical
sample1
sample2
nrow(merge(sample1, sample2))

[1] 9999

# row splitting yields very different test sets, even though we've set the seed
test <- sample(N-1, N/2, replace = F)

test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

# to fix that, we can use some hash function to sample on the last digit

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

# hash splitting preserves the similarity, because the assignment of test/train 
# is determined by the hash of each obs., and not by its relative location in the data
# which may change 
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo

dzeltzer
  • 990
  • 8
  • 28
  • Added as a separate question: https://stackoverflow.com/questions/52769681/reproducible-splitting-of-data-into-training-and-testing-in-r – dzeltzer Oct 11 '18 at 22:34
  • I want to develop auto.arima model from multiple time series data and I want to use 1 year of data, 3 year of data, 5, 7... in a two year interval from each series to build the model and testing it in the remaining testing set. How do I do the subsetting so that the fitted model will have what I want? I appreciate for your help – Stackuser Apr 09 '20 at 04:35
1

I bumped into this one, it can help too.

set.seed(12)
data = Sonar[sample(nrow(Sonar)),]#reshufles the data
bound = floor(0.7 * nrow(data))
df_train = data[1:bound,]
df_test = data[(bound+1):nrow(data),]
user322203
  • 101
  • 7
1

Create an index row "rowid" and use anti join to filter out using by = "rowid". You can remove the rowid column by using %>% select(-rowid) after the split.

data <- tibble::rowid_to_column(data)

set.seed(11081995)

testdata <- data %>% slice_sample(prop = 0.2)

traindata <- anti_join(data, testdata, by = "rowid")

0
set.seed(123)
llwork<-sample(1:length(mydata),round(0.75*length(mydata),digits=0)) 
wmydata<-mydata[llwork, ]
tmydata<-mydata[-llwork, ]
0

I think this would solve the problem:

df = data.frame(read.csv("data.csv"))
# Split the dataset into 80-20
numberOfRows = nrow(df)
bound = as.integer(numberOfRows *0.8)
train=df[1:bound ,2]
test1= df[(bound+1):numberOfRows ,2]
0

I prefer using dplyr to mutate the values

set.seed(1)
mutate(x, train = runif(1) < 0.75)

I can keep using dplyr::filter with helper functions like

data.split <- function(is_train = TRUE) {
    set.seed(1)
    mutate(x, train = runif(1) < 0.75) %>%
    filter(train == is_train)
}
vladiim
  • 1,862
  • 2
  • 20
  • 27
0

I wrote a function (my first one, so it might not work well) to make this go faster if I'm working with multiple data tables and don't want to repeat the code.

xtrain <- function(data, proportion, t1, t2){
  data <- data %>% rowid_to_column("rowid")
  train <- slice_sample(data, prop = proportion)
  assign(t1, train, envir = .GlobalEnv)
  test <- data %>% anti_join(as.data.frame(train), by = "rowid")
  assign(t2, test, envir = .GlobalEnv)
}

xtrain(iris, .80, 'train_set', 'test_set')

You'll need to have dplyr and tibble loaded. This takes a given dataset, the proportion you want use for sampling, and two object names. The function creates the table and then assigns them as an object in your global environment.

0

try using idx <- sample(2, nrow(data), replace = TRUE, prob = c(0.75, 0.25)) and the using the provided ids to access split data training <- data[idx == 1,] testing <- data[idx == 2,]

Shreyas Shrawage
  • 340
  • 2
  • 10
-2

There is a very simple way to select a number of rows using the R index for rows and columns. This lets you CLEANLY split the data set given a number of rows - say the 1st 80% of your data.

In R all rows and columns are indexed so DataSetName[1,1] is the value assigned to the first column and first row of "DataSetName". I can select rows using [x,] and columns using [,x]

For example: If I have a data set conveniently named "data" with 100 rows I can view the first 80 rows using

View(data[1:80,])

In the same way I can select these rows and subset them using:

train = data[1:80,]

test = data[81:100,]

Now I have my data split into two parts without the possibility of resampling. Quick and easy.

Community
  • 1
  • 1
  • 1
    Althought it is true that data can be split that way, it is not advised. Some datasets are ordered by a variable that you are not aware of. So its best to sample which rows will be considered as training instead of taking the first n rows. – user5029763 Sep 06 '18 at 20:11
  • 1
    If you shuffle the data before separating them to test and training set, your suggestion works. – Hadij Mar 14 '19 at 16:51