0

https://www.dropbox.com/s/35w66sri5rauv5d/FlightDelays.csv?dl=0

I am reading a dataset from the above link which contains 2201 rows.using split function,i gave the ratio as 0.6.Then i should be getting two datasets conatining 1320 and 881 datasets respectively.Initally it was working fine,but now when i am splitting ,instead of 0.6 the split is happening for 0.53,Even though i specified 0.6 as my ratio in the split.what might be the issue for this sudden change.How to resolve this.Code is given below.

library(caTools)
originaldata.df<-read.csv("use csv from the link given above")
split<-sample.split(originaldata.df,SplitRatio = 0.6)
Trainingdataset<-subset(originaldata.df,split == "TRUE")
Testingdataset<-subset(originaldata.df,split == "FALSE")

ExpectedOutput:
1320(2201*60/100)
881(2201*40/100)
Actualoutput:
1186
1015
Srujan K.N.
  • 51
  • 1
  • 9
  • I am too tired to figure out why, but I realized `caTools` samples on the columns instead of rows. – M-- Jun 06 '17 at 03:39
  • Possible duplicate of [How to split data into training/testing sets using sample function in R program](https://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function-in-r-program) – Ronak Shah Jun 06 '17 at 04:05

2 Answers2

0

Base-R:

You can use indices and assign them by the split ratio;

indexes = sample(1:nrow(originaldata.df), 
size=0.6*nrow(originaldata.df))

Trainingdataset <- originaldata.df[indexes,]
Testingdataset <- originaldata.df[-indexes,]

This would be the output:

> dim(Testingdataset)
# [1] 881  13
> dim(Trainingdataset)
# [1] 1320   13

caTools package:

library(caTools)
#It should be applied on one of column of the data.frame otherwise samples over rows;
split<-sample.split(originaldata.df$schedtime,SplitRatio = 0.6) 

Trainingdataset<-subset(originaldata.df,split == "TRUE")
Testingdataset<-subset(originaldata.df,split == "FALSE")

And size of subsets (not exactly what you expect;)

> dim(Trainingdataset)
# [1] 1323   13
> dim(Testingdataset)
# [1] 878  13
Community
  • 1
  • 1
M--
  • 25,431
  • 8
  • 61
  • 93
0

Here's a customised split function that will derive two subset of rownumbers based on the given proportion:

splitFactor <- function(rows, prop){
  a <- sample(seq(rows), ceiling(rows*prop))
  b <- sample(seq(rows), floor(rows*(1-prop)))
  list(a[order(a)],b[order(b)])
}


sp.53 <- splitFactor(nrow(iris), .53)
lapply(sp.53, length)

# [[1]]
# [1] 80

# [[2]]
# [1] 70

To derive training and test set with the function:

all.sets <- lapply(splitFactor(nrow(iris), .6),
                   function(x) iris[x,])

lapply(all.sets, dim)

# [[1]]
# [1] 90  5

# [[2]]
# [1] 60  5
Adam Quek
  • 6,973
  • 1
  • 17
  • 23