0

I have a dataset, which looks like:

y  Age   Height
0  Aage  Aheight
1  Bage  Bheight

All variables are divided into at least two categories. When I open dataset with the code:

DM_input = read.csv(file="C:/Users/user/Desktop/test.CSV",header = TRUE, sep = ",")

R correctly shows: 5040 observations of 11 variables. When I try to break down dataset into test and train with the following code:

> train <- DM_input[DM_input$rand <= 0.7, c(2,3,4,5,6,7,8,9,10)]
> test <- DM_input[DM_input$rand > 0.7, c(2,3,4,5,6,7,8,9,10)]

I get 0 observations out of 11 variables, and the tables are empty. I do not understand why that is happening, I removed special characters - it did not help. Thanks

  • Could you use `dput(DM_input[1:10,])` to show us the first few rows of your data? If we can't see what's in `DM_input`, we aren't going to be able to help you. – user2554330 Aug 02 '20 at 09:05
  • What does `range(DM_input$rand)` return? Also `c(2,3,4,5,6,7,8,9,10)` can be written as `2:10`. – Ronak Shah Aug 02 '20 at 10:05
  • Warning messages: 1: In min(x, na.rm = na.rm) : no non-missing arguments to min; returning Inf 2: In max(x, na.rm = na.rm) : no non-missing arguments to max; returning -Inf – Mayya Lihovodov Aug 03 '20 at 09:07
  • @user2554330 this query returns perfectly all the data with proper reading each var as factor – Mayya Lihovodov Aug 03 '20 at 09:07
  • The warning message you are seeing means that `DM_input$rand` is length 0. Are you sure you have a column with that exact name? Remember that R is case-sensitive, so `DM_input$Rand` would be different. – user2554330 Aug 03 '20 at 11:44

1 Answers1

0

I think that sample.int could help you to break down dataset.

Here is a example:

data(iris)

# number of rows of dataset
size_iris <- nrow(iris)

# set the proportion of sample split to 0.7
size_sample <- floor(0.7*size_iris)

# set a reproducible random result
set.seed(2020)

# sample the dataset
mysample <- sample.int(n=size_iris, size=size_sample, replace=F)
train <- iris[mysample,]
test <- iris[-mysample,]

# checking sizes
size_iris
[1] 150
nrow(train)
[1] 105
nrow(test)
[1] 45

There is a similar question here with a lot of good answers: How to split data into training/testing sets using sample function