0

I am trying to use a loop function to reduce the length of a data set. I am trying to sample equally from each of the four subgroups within my data frame (all equal length). I am having trouble coming up with code that will be able to sample n-1 rows from each subgroup, where n represents the current length of the subgroup. My current code is as follows:

sub.df<- function(x){
  library(data.table)
  library(tidyverse)
  setDT(x)
  while(nrow(x) > 24) { 
    x.1 <- x %>% # this is the beginning of the sample part
      group_by(x$spiral) %>% 
      tally() %>% select(-n) %>%
      sample_n(x, nrow(x)-1, replace = FALSE) #this is where I have trouble
    ks <- ks.test(dist(x[,c(1,2)]), unif.null) #this part is for evaluating the exclusions
    ks.1 <- ks.test(dist(x.1[,c(1,2)]), unif.null)
    if(ks.1$statistic > ks$statistic) {x <- x.1} else {x <- x}
  }

}

An example of the data:

x.cord  y.cord  subgroup
1       1       1
1       4       1
3       5       1
2       1       1
2       -3      2
3       -1      2
3       -2      2
1       -3      2
-2      -2      3
-4      -1      3
-5      -5      3
-2      -1      3
-3      4       4
-1      1       4
-2      5       4
-4      3       4

Now, if the loop ran correctly, the first instance would sample 3 (4-1) from each subgroup, then 2 (3-1), then 1 (2-1). So my final data would be something like:

x.cord   y.cord   subgroup
3        5        1
1        -3       2
-5       -5       3
-4       3        4

Based on my provided code my actual data set would have 24 points, 6 from each subgroup, but this should hopefully illustrate what I am trying to do.

Isaac
  • 59
  • 1
  • 6
  • 2
    I suggest you provide [sample data and desired output](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – r2evans Jun 09 '18 at 05:33

2 Answers2

3

At a high level I know that I want to use group_by() and filter()

group_by(x, subgroup) %>% filter(predicate_n_minus_1(subgroup))

So the challenge is to write and test predicate_n_minus_1(). I came up with

predicate_n_minus_1 <- function(x)
    seq_along(x) %in% sample(length(x) - 1)

This is easy to test, including the important edge case of a zero- and one-length subgroup

library(testthat)
expect_equal(predicate_n_minus_1(integer()), logical())        # length 0
expect_equal(predicate_n_minus_1(integer(1)), FALSE)           # length 1
expect_equal(length(predicate_n_minus_1(integer(5))), 5)       # length isomorphism
expect_equal(sum(predicate_n_minus_1(integer(5))), 4)          # n - 1
expect_equal(sum(predicate_n_minus_1(letters)), length(letters) - 1) # other types!

I know that this isn't a pure tidyverse solution, but it seems so much cleaner, more easily tested, and more easily modified than nested function calls in MKR's answer. Maybe there's a tidyverse solution that similarly separates the overall data manipulation from the filter specification?

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
0

In my opinion you are not using sample_n correctly. The function group_size can help you to find the size of the group. Assuming, all group are of same size, you can replace your select statement in function as below.

Lets. first demonstrate, how this sub-sampling will work. OP can use it as part of function once verified.

The use of min(group_size(group_by(.,subgroup)))-1 will ensure that 1 less than group with with least rows will be sampled.

library(tidyverse)
x %>% # this is the beginning of the sample part
  group_by(subgroup) %>%  # This will ensure that equal selection from each group
  sample_n(.,min(group_size(group_by(.,subgroup)))-1, replace = FALSE)

#Result - 3 from each subgroup has been selected. 

# # A tibble: 12 x 3
# # Groups: subgroup [4]
# x.cord y.cord subgroup
# <int>  <int>    <int>
# 1      1      1        1
# 2      3      5        1
# 3      2      1        1
# 4      2     -3        2
# 5      3     -1        2
# 6      1     -3        2
# 7     -4     -1        3
# 8     -2     -1        3
# 9     -5     -5        3
# 10     -4      3        4
# 11     -2      5        4
# 12     -3      4        4

Now, since verification has be been done above, lets modify function.

Note: Function is not tested. Request OP to test with real data.

# modified function should be as
sub.df<- function(x){
  library(tidyverse)
  while(nrow(x) > 24) { 
    x.1 <- x %>% # this is the beginning of the sample part
      group_by(spiral) %>% 
      sample_n(.,min(group_size(group_by(.,spiral)))-1, replace = FALSE)
    ks <- ks.test(dist(x[,c(1,2)]), unif.null) #this part is for evaluating the exclusions
    ks.1 <- ks.test(dist(x.1[,c(1,2)]), unif.null)
    if(ks.1$statistic > ks$statistic) {x <- x.1} else {x <- x}
  }
  x
}

Data:

x <- read.table(text =
"x.cord  y.cord  subgroup
1       1       1
1       4       1
3       5       1
2       1       1
2       -3      2
3       -1      2
3       -2      2
1       -3      2
-2      -2      3
-4      -1      3
-5      -5      3
-2      -1      3
-3      4       4
-1      1       4
-2      5       4
-4      3       4",
header = TRUE)
MKR
  • 19,739
  • 4
  • 23
  • 33