-1

I have a question about show leftovers from sample function. For school we had to make a test dataframe and a train dataframe. The data that I have to validate has only a train dataframe. The raw dataframe has 2158 observations. They made a train dataframe with 1529 observations.

set.seed(22)
train <- Gary[sample(1:nrow(Gary), 1529,
                 replace=FALSE),]

train[, 1] <- as.factor(unlist(train[, 1]))
train[, 2:201] <- as.numeric(as.factor(unlist(train[, 2:201])))    

Now I want to have the "leftovers" in a different dataframe.

Do some of you know how to do this?

Uwe
  • 41,420
  • 11
  • 90
  • 134

2 Answers2

1

You can use negative indexing in R if you know the training indices. So we only need to rewrite your first lines:

set.seed(22)
train_indices <- sample(1:nrow(Gary), 1529, replace=FALSE)
train <- Gary[train_indices, ]
test <- Gary[-train_indices, ]
# Proceed with rest of script.
AlexR
  • 2,412
  • 16
  • 26
0

This can be done using the setdiff() function.

Edit: Please note that there is another answer by @AlexR using negative indexing which is much simpler if the indices are only used for subsetting.

However, first we need to create some dummy data as ther OP hasn't provided any data with the question (For future use, please read How to make a great R reproducible example?):

Dummy data

Create dummy data frame with 2158 rows and two columns:

n <- 2158
Gary <- data.frame(V1 = seq_len(n), V2 = sample(LETTERS, n , replace =TRUE))
str(Gary)
#'data.frame':  2158 obs. of  2 variables:
# $ V1: int  1 2 3 4 5 6 7 8 9 10 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 21 11 24 10 5 17 18 1 25 7 ...

Sampled and leftover rows

First, the vectors of sampled and leftover rows are computed, before subsetting Gary in subsequent steps:

set.seed(22)
sampled_rows <- sample(seq_len(nrow(Gary)), 1529, replace=FALSE)
leftover_rows <- setdiff(seq_len(nrow(Gary)), selected_rows)

train <- Gary[sampled_rows, ]
leftover <- Gary[leftover_rows, ]

str(train)
#'data.frame':  1529 obs. of  2 variables:
# $ V1: int  657 1025 2143 1123 1817 1558 1324 1590 898 801 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 19 16 25 15 2 5 8 14 20 3 ...
str(leftover)
#'data.frame':  629 obs. of  2 variables:
# $ V1: int  2 5 6 7 8 9 10 12 20 24 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 11 5 17 18 1 25 7 25 7 18 ...

leftover is a data frame which contains the rows of Gary which haven't been sampled.

Verification

To verify, we combine train and leftover again and sort the rows to compare with the original data frame:

recombined <- rbind(train, leftover)
identical(Gary, recombined[order(recombined$V1), ])
#[1] TRUE
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Better use negative indexing! `Gary[-sampled_rows,]`. No need for anything like `setdiff` or the like. – AlexR Jan 21 '17 at 15:43
  • @AlexR, you're right! Didn't thought about it. Negative indexing is much simpler if the indices are only used for subsetting – Uwe Jan 21 '17 at 16:08
  • AlexR answer works!!! Thanks you all – Niek Bezuijen Jan 23 '17 at 09:31
  • @NiekBezuijen Appreciate your feedback. Excellent that AlexR's answer works for you (and I learned something as well). As you are new to Stack Overflow, I kindly suggest to read [this](http://stackoverflow.com/help/accepted-answer) and may be [this](http://stackoverflow.com/help/whats-reputation). – Uwe Jan 23 '17 at 09:48