Divide data set into 60%, 20%, 20%

Question

I am trying to move from 2 sets of data to 3 sets as explained in the above question. Following is the script I used:

set.seed(125)
d <- sample(x = nrow(db), size = nrow(db) * 0.60, )
train60 <-db[d, ]
valid40 <-db[-d, ]

Is there a way to modify the above script? I have tried to create another line:

valid40 <- db[-d] * 0.2 which did not work.

Current dataset has several factor variables.

I have tried using Frank's solution here on the cut function, but somehow I manage to get

Error in cut.default(seq(nrow(df)), nrow(df) * cumsum(c(0, spec)), labels = names(spec)) : lengths of 'breaks' and 'labels' differ

which I don't understand even after searching for help online.

Sample 50 % of valid40? .. sample(valid40,nrow(valid40)/2). – r.user.05apr May 23 '17 at 09:41 — r.user.05apr, May 23 '17 at 09:41
Also perhaps see `modelr::crossv_mc`. – Axeman May 23 '17 at 09:43 — Axeman, May 23 '17 at 09:43

score 5 · Accepted Answer · answered May 23 '17 at 09:49

5

If I understood you correctly then you want a bifurcation of 60%, 20% and 20% of sample without repeat. I have taken iris data for an example which contains 150 rows and 5 columns.

samp <- sample(1:nrow(iris),.6*nrow(iris)) ##60 and 40 bifurcation

train60 <- iris[samp,] ## This is the 60% chunk
remain40 <- iris[-samp,]  ## This is used for further bifurcation

samp2 <- sample(1:nrow(remain40),.5*nrow(remain40))

first20 <- remain40[samp2,] ## First chunk of 20%
secnd20 <- remain40[-samp2,] ## Second Chunk of 20%

Reduce("intersect",list(train60,first20,secnd20)) ##Check to find if there is any intersect , 0 rows means everything is fine and sample are not repetitive.

answered May 23 '17 at 09:49

PKumar

10,971
6
37
52

@halfer: apology about adding 'urgent' to the question. – user149635 May 24 '17 at 02:09
Hi Jeppe & Pkr :thank you so much for answering my call of 'distress' with the provided detailed explanation which is very helpful to a newbie. Both of the scripts work very well! and I would like to vote both of your answers but unsure whether the system will permit. – user149635 May 24 '17 at 02:15

score 2 · Answer 2 · answered May 23 '17 at 10:02

db <- data.frame(x=1:10, y=11:20)

set.seed(125)
d <- sample(x=nrow(db),size=nrow(db)*0.60,)

train60 <-db[d,]

valid40 <-db[-d,]

Now just take half of valid40 in each new dataframe:

e <- sample(x=nrow(valid40),size=nrow(valid40)*0.50,)

train20 <-valid40[e,]
valid20 <- valid40[-e,]

Divide data set into 60%, 20%, 20%

2 Answers2