-3

The question says:

Load the data and split it into 75% training and 25% validation data using set.seed(4650).

this is what I have:

setwd("C:/Users/Downloads")
cat = read.csv("cat.csv")
set.seed(4650)
train = sample(c(TRUE, TRUE, TRUE, FALSE), nrow(cat), rep = TRUE)
validation = (!train)

And I need to provide summary of the training data.

summary(train)

which gives me

Mode       FALSE   TRUE
logical    830     2463

Am I splitting the data in the right way?

Thank you very much.

JungleDiff
  • 3,221
  • 10
  • 33
  • 57

2 Answers2

6

This is how data splitting is done in Max Kuhn's book on the caret package.

library(caret)
set.seed(4650)
trainIndex <- createDataPartition(iris$Species, 
                                  p = .75, 
                                  list = FALSE, 
                                  times = 1)

irisTrain <- iris[ trainIndex,]
irisTest  <- iris[-trainIndex,]
tyluRp
  • 4,678
  • 2
  • 17
  • 36
4

Here's what you can do.

#Example Data
df <- iris

n_train <- round(nrow(iris) * 0.75)

train <- sample(1:nrow(iris), n_train, replace = FALSE)
test <- (1:nrow(iris))[-train]

train_df <- df[train, ]
test_df <- df[test, ] # same as df[-train, ]

summary(train_df)
kangaroo_cliff
  • 6,067
  • 3
  • 29
  • 42
  • I want to develop auto.arima model from multiple time series data and I want to use 1 year of data, 3 year of data, 5, 7... in a two year interval from each series and testing it in the testing set. How do I do the subsetting so that the fitted model will have what I want? I appreciate for your help – Stackuser Apr 09 '20 at 04:10