2

I want to divide my data set into train and test data. but I have one column as a group.All member of a group must be in train or test. for example if the group column is like this:

         group
           1
           1
           1
           1
           1
           2
           2
           2
           3
           3

if one of the row of first group is in train set the first 5 rows must be in there and ...

3 Answers3

1

A solution using dplyr. dat_train and dat_test is the final result. I assume a case with 10000 group of training dataset and 5000 group of testing dataset.

library(dplyr)

# Set seed for reproducibility
set.seed(12345)

# Create an example data frame with group and data
dat <- tibble(group = rep(1:15000, each = 5),
              data = rnorm(75000))

# Step 1: Create a look up table showing group number
g <- dat %>% distinct(group)

# Step 2: Use sample_n to sampel for train
g_train <- g %>% sample_n(size = 10000)

# Step 3: Use semi_join and anti_join to split dat into train and test
dat_train <- dat %>% semi_join(g_train, by = "group")
dat_test <- dat %>% anti_join(g_train, by = "group")
www
  • 38,575
  • 12
  • 48
  • 84
0

Let's assume you have a total of 20 groups and you want 8 groups in the training set and the remaining 12 in your test set.

First, let's generate some data to play with:

dat <- data.frame(group=factor(rep(1:20, each=5)), value=rnorm(100))

As you want to sample by group rather than observation, now draw a random sample of size 8 from groups for your training set and put the rest into the test set.

train.groups <- sample(levels(dat$group), 8)
dat.train <- dat[dat$group %in% train.groups, ]
dat.test <- dat[!(dat$group %in% train.groups), ]
Phil
  • 185
  • 1
  • 9
  • what is each=5? –  Sep 14 '19 at 18:01
  • From ?rep: ‘each’ non-negative integer. Each element of ‘x’ is repeated ‘each’ times. Other inputs will be coerced to an integer or double vector and the first element taken. Treated as ‘1’ if ‘NA’ or invalid. – Phil Sep 14 '19 at 18:07
  • I mean why 5? rep is repeating 20 groups 5 times? why? –  Sep 14 '19 at 18:10
  • Well, to replicate your example, where you also had 5 instances of at least the first group. But it does not really matter for the solution. – Phil Sep 14 '19 at 18:12
  • Error in sample.int(length(x), size, replace, prob) : invalid first argument –  Sep 14 '19 at 20:49
  • `sample` ≠ `sample.int`. They take different parameters. – Phil Sep 20 '19 at 10:11
-1

You could use the dplyr and tidyverse (package) to solve this.

Assuming your dataset's name is df1.

Here is an example:

library(dplyr)
library(tidyverse)

training_data <- df1 %>% filter(group=1)

testing_data <- df1 %>% filter(group=2)
James Martherus
  • 1,033
  • 1
  • 9
  • 20