I try to seperate my data t otest and training set. I use the sample function and when i try to set the training set I write:
training<-data[selected, ]
and the error i get is that there is incorrect number of dimensions. What should I do?
I try to seperate my data t otest and training set. I use the sample function and when i try to set the training set I write:
training<-data[selected, ]
and the error i get is that there is incorrect number of dimensions. What should I do?
Here is an example where we create some random data and take a 60% sample to set as the training data. First, we create a data frame containing 1,000 rows of 10 columns of random data, along with an id
column that represents the row number for an observation within the data frame.
set.seed(95014) # set seed to make sample reproducible
data <- data.frame(id=1:1000,matrix(runif(10000,max=100),nrow=1000))
Next, we sample 600 ids from the id
vector.
aSample <- sample(data$id,600)
head(aSample)
> head(aSample)
[1] 559 570 118 121 934 49
Finally, we use the id
values to split the data into test and training data frames.
training <- data[data$id %in% aSample,]
nrow(training)
head(training)
testing <- data[!(data$id %in% aSample),]
nrow(testing)
head(testing)
...and the output:
> training <- data[data$id %in% aSample,]
> nrow(training)
[1] 600
> head(training)
id X1 X2 X3 X4 X5 X6 X7 X8
1 1 93.79708 55.77981 47.321792 58.25367 73.40112 22.78027 31.21750 81.908545
2 2 21.64439 98.72992 81.351044 31.92606 36.06994 28.52702 88.30162 9.474531
4 4 42.61110 95.83472 3.142368 94.38545 84.18761 47.67777 16.98694 24.277995
7 7 44.54569 59.44912 9.658479 62.29967 18.64361 41.94804 72.89537 41.777294
8 8 94.62813 84.44770 37.480883 85.66058 88.25963 9.99134 89.65660 86.425941
9 9 12.25220 70.70493 92.889167 74.90797 56.46179 18.65665 86.23158 76.616870
X9 X10
1 23.179851 64.64663
2 53.568108 16.69363
4 1.229717 41.45356
7 2.258932 10.91128
8 25.547644 98.37873
9 83.552602 46.69852
> testing <- data[!(data$id %in% aSample),]
> nrow(testing)
[1] 400
> head(testing)
id X1 X2 X3 X4 X5 X6 X7
3 3 20.89891 12.98920 80.051161 92.98576 66.56050 67.77384 84.82885
5 5 32.13445 83.12521 47.775644 38.55591 49.14070 69.26557 69.02516
6 6 39.76976 25.96758 6.683530 98.92120 22.67881 43.15225 89.78034
10 10 16.92903 99.77142 51.578957 68.83097 74.68267 21.93792 80.45868
11 11 84.66744 39.72422 8.587481 17.10894 71.81957 92.79043 87.10920
12 12 93.03164 34.98200 18.010040 90.85953 96.07546 60.30213 32.97798
X8 X9 X10
3 68.375392 24.35218 37.49941
5 48.904674 56.62582 92.65490
6 88.658118 88.50311 46.35529
10 50.494296 51.46921 74.25503
11 33.429481 14.93083 48.94056
12 8.279923 22.67349 41.68959
>
Notice that the id
values in training
do not match those in testing
. We can verify this with the following code.
# verify no ids in training are in testing
sum(testing$id %in% training$id)
...and the output:
> # verify no ids in training are in testing
> sum(testing$id %in% training$id)
[1] 0
>
Note that if one wants to create a random sample based on the values of a dependent variable as the split criterion for testing vs. training data, we would use caret::createDataPartition()
. This approach ensures that the distribution of values for the dependent value is roughly equivalent across the training and test data frames.
We'll use the Sonar
data frame from the mlbench
package to illustrate this.
library(mlbench)
data(Sonar)
library(caret)
set.seed(95014)
inTraining <- createDataPartition(Sonar$Class,p = .75,list = FALSE)
training <- Sonar[inTraining,]
testing <- Sonar[-inTraining,]
table(training$Class)
table(testing$Class)
...and the output:
> table(training$Class)
M R
84 73
> table(testing$Class)
M R
27 24