-2

I'm trying to work out how, in Azure ML (and therefore R solutions are acceptable), to randomly split data based on a column, such that all records with any given value in that column wind up in one side of the split or another. For example:

+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
|       1234 |    1 | Foo                |    1 |
|       5678 |    0 | Bar                |    1 |
|    9101112 |    1 | Quack              |    1 |
|   13141516 |    1 | Meep               |    1 |
|       1234 |    0 | Boop               |    2 |
|       5678 |    0 | Baa                |    2 |
|    9101112 |    0 | Bleat              |    2 |
|   13141516 |    1 | Maaaa              |    2 |
|       1234 |    0 | Foo                |    3 |
|       5678 |    0 | Bar                |    3 |
|    9101112 |    1 | Quack              |    3 |
|   13141516 |    1 | Meep               |    3 |
|       1234 |    1 | Boop               |    4 |
|       5678 |    1 | Baa                |    4 |
|    9101112 |    0 | Bleat              |    4 |
|   13141516 |    1 | Maaaa              |    4 |
+------------+------+--------------------+------+

Acceptable output from that if I chose, say, a 50/50 split and to be grouped based on the Student ID column would be two new datasets:

+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
|       1234 |    1 | Foo                |    1 |
|       1234 |    0 | Boop               |    2 |
|       1234 |    0 | Foo                |    3 |
|       1234 |    1 | Boop               |    4 |
|    9101112 |    1 | Quack              |    1 |
|    9101112 |    0 | Bleat              |    2 |
|    9101112 |    1 | Quack              |    3 |
|    9101112 |    0 | Bleat              |    4 |
+------------+------+--------------------+------+

and

+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
|       5678 |    0 | Bar                |    1 |
|       5678 |    0 | Baa                |    2 |
|       5678 |    0 | Bar                |    3 |
|       5678 |    1 | Baa                |    4 |
|   13141516 |    1 | Meep               |    1 |
|   13141516 |    1 | Maaaa              |    2 |
|   13141516 |    1 | Meep               |    3 |
|   13141516 |    1 | Maaaa              |    4 |
+------------+------+--------------------+------+

Now, from what I can tell this is basically the opposite of stratified split, where it would get a random sample with every student represented on both sides.

I would prefer an Azure ML function that did this, but I think that's unlikely so is there an R function or library that gives this kind of functionality? All I could find was questions about stratification which obviously don't help me much.

Jeff
  • 12,555
  • 5
  • 33
  • 60
  • 1
    Just `sample` the `unique` student IDs and subset rows with `%in%`. – alistaire Jun 28 '17 at 01:01
  • @alistaire that makes sense, and I feel pretty dumb for not thinking of it :/ if you feel like adding that as an answer I'll accept it – Jeff Jun 28 '17 at 02:04

1 Answers1

0

You can use te following command:

data.fold <- mutate(df, fold = sample(rep_len(1:2, n_distinct(Student ID)))[Student ID])

It returns the original dataframe with an new column that indicates the fold that the student is in. If you want more folds, just adjust the '1:2' part.

I've tried the 'sample unique' way but it did not always work for me in the past.

Lukas Heeren
  • 149
  • 1
  • 10