1

I have a data.frame which looks like this:

Patch, Xmeters,  Ymeters,    Genome
2,    6050, 3954.850,   4 2  0 2  1 0  0 0  2 1  2 1  1 0  2 2  2 0  1 0 
2,    5900, 6293.118,   4 0  2 0  1 0  0 0  0 0  0 3  0 4  2 0  1 1  1 2 
1,    4550, 4301.260,   3 3  1 3  0 1  0 0  0 0  0 3  2 2  0 0  0 0  1 2 
2,    6150, 2396.004,   0 0  1 2  2 1  0 0  0 0  2 4  0 3  2 0  0 0  0 1 
1,    4400, 4907.477,   2 4  1 4  0 0  0 1  2 2  0 4  1 1  2 2  2 0  1 2 
3,    8550, 6033.310,   3 1  2 1  0 2  0 0  0 0  0 4  2 4  0 2  0 0  0 2 

"Patch" includes values 1:6. 1:6 contain unequal number of rows.
My goal is to pull a random 25 rows from each Patch and compile in a new data frame; a data frame with 150 rows, 25 from each of 6 patches.

I am building off this post for extracting a random sample of rows with a conditional using 'sample' and 'which'.

sample = NULL
subset = NULL

for (patch in c(1:6))
{sample <- H0_LONG[ sample( which( H0_LONG$Patch == "[", patch, "]"), 25, replace = FALSE), ]

subset<- rbind(subset, sample)
rm(sample)
}

I'm a novice for-loop user, but I'm trying to make this work because I need to do this sampling for several hundred more data frames (read, more elegant suggestions welcome.) If I enter an integer in the 'which' function, it works fine, and the 'rbind' function seems to work fine - so I'm assuming there's something wrong with my execution of the for-loop.

Community
  • 1
  • 1
JMWD
  • 13
  • 3
  • Something like `df[unlist(tapply(rownames(df), df$Patch, sample, 25)),]` or a version which will work for both `data.frame`s and `matrix` objects: `df[unlist(tapply(seq_len(nrow(df)), df[,"Patch"], sample, 25)),]` – thelatemail Sep 01 '15 at 23:36

2 Answers2

3

The group_by and sample_n functions in the dplyr package let you do this easily:

library(dplyr)
subset <- H0_LONG %>%
    group_by(Patch) %>%
    sample_n(25)

This approach will typically also run faster than a for loop. Note that this code is just another way of writing:

subset <- sample_n(group_by(H0_LONG, Patch), 25)
David Robinson
  • 77,383
  • 16
  • 167
  • 187
2

I synthesized my own data here since you didn't provide your complete data:

set.seed(1);
counts <- 25:30;
H0_LONG <- data.frame(Patch=sample(rep(1:6,sample(counts))),Xmeters=4000L+sample(1:80,sum(counts),replace=T)*50L,Ymeters=runif(sum(counts),2000,7000),Genome=replicate(sum(counts),paste0(c('  ',' '),sample(0:4,20,replace=T),collapse='')),stringsAsFactors=F);
head(H0_LONG);
##   Patch Xmeters  Ymeters                                             Genome
## 1     6    7400 4212.962   3 3  3 2  4 2  4 3  2 4  2 0  3 3  0 3  2 1  3 4
## 2     4    7450 2783.571   3 1  0 1  4 4  3 4  4 3  0 1  1 0  0 1  1 0  3 1
## 3     4    5600 4911.026   4 3  1 0  0 4  1 2  3 4  2 0  4 3  1 0  0 4  4 3
## 4     1    5550 6850.811   0 1  3 4  3 1  3 1  4 0  2 1  0 4  2 3  2 4  3 1
## 5     2    7600 6947.499   4 3  0 2  2 0  2 4  3 2  1 3  3 4  3 2  2 1  2 4
## 6     2    6600 2882.260   4 0  3 4  4 1  1 4  0 4  1 2  2 3  2 0  1 3  0 4
nrow(H0_LONG);
## [1] 165
table(H0_LONG$Patch);
##
##  1  2  3  4  5  6
## 26 30 27 28 25 29

For a base R solution you can use by() to do the grouping:

res <- do.call(rbind,by(H0_LONG,H0_LONG$Patch,function(g) g[sample(seq_len(nrow(g)),min(nrow(g),25)),]));
head(res);
##       Patch Xmeters  Ymeters                                             Genome
## 1.134     1    5550 5451.284   0 4  3 0  0 0  2 3  4 4  0 0  1 3  3 3  0 2  3 1
## 1.112     1    7550 6712.527   3 0  2 0  4 0  3 4  3 1  3 0  0 0  1 2  3 1  0 2
## 1.96      1    4300 2362.303   0 4  3 0  2 0  3 3  0 4  4 2  3 2  1 3  3 4  3 0
## 1.137     1    7300 5562.701   3 0  1 3  4 0  3 3  4 4  0 1  4 1  2 2  2 4  2 2
## 1.4       1    5550 6850.811   0 1  3 4  3 1  3 1  4 0  2 1  0 4  2 3  2 4  3 1
## 1.86      1    5000 4573.663   4 0  2 4  2 2  1 2  0 3  4 0  2 0  3 2  1 3  1 2
nrow(res);
## [1] 150
table(res$Patch);
##
##  1  2  3  4  5  6
## 25 25 25 25 25 25

For the number of selections, I used min(nrow(g),25) instead of just 25 to handle patches with less than 25 rows in the input data.frame, if such were to ever exist.


I would also recommend looking into data.table:

library(data.table);
H0_LONG_dt <- as.data.table(H0_LONG);
H0_LONG_dt;
##      Patch Xmeters  Ymeters                                             Genome
##   1:     6    7400 4212.962   3 3  3 2  4 2  4 3  2 4  2 0  3 3  0 3  2 1  3 4
##   2:     4    7450 2783.571   3 1  0 1  4 4  3 4  4 3  0 1  1 0  0 1  1 0  3 1
##   3:     4    5600 4911.026   4 3  1 0  0 4  1 2  3 4  2 0  4 3  1 0  0 4  4 3
##   4:     1    5550 6850.811   0 1  3 4  3 1  3 1  4 0  2 1  0 4  2 3  2 4  3 1
##   5:     2    7600 6947.499   4 3  0 2  2 0  2 4  3 2  1 3  3 4  3 2  2 1  2 4
##  ---
## 161:     6    5200 4170.154   2 4  4 3  2 3  0 0  4 0  3 0  1 1  1 1  0 3  2 2
## 162:     1    5600 4585.049   4 1  3 4  1 0  3 2  0 4  3 4  4 2  4 1  0 2  1 1
## 163:     2    7250 6231.229   1 1  0 0  4 2  0 2  2 1  2 0  2 0  2 1  4 4  1 1
## 164:     3    4350 2275.821   4 1  4 4  4 0  0 2  1 3  2 1  0 4  0 3  4 2  0 0
## 165:     5    5500 4770.885   1 3  3 1  2 0  0 2  4 3  3 2  0 4  1 0  4 3  4 1
res2 <- H0_LONG_dt[,.SD[sample(seq_len(.N),min(.N,25))],Patch];
res2;
##      Patch Xmeters  Ymeters                                             Genome
##   1:     6    5500 2715.833   0 1  1 1  1 1  1 4  3 0  3 2  1 0  0 1  1 1  4 4
##   2:     6    7250 6695.684   2 3  1 2  3 4  0 1  0 4  0 0  3 4  2 1  1 3  1 2
##   3:     6    6900 3109.069   3 1  3 1  1 2  4 0  1 1  1 3  4 3  3 0  4 3  4 0
##   4:     6    7850 6892.770   2 1  3 4  1 2  1 0  3 2  1 0  3 4  0 3  0 1  3 1
##   5:     6    7850 2113.706   3 0  0 0  1 4  1 1  4 4  4 0  2 4  0 4  2 1  2 3
##  ---
## 146:     5    4700 6678.562   3 1  3 0  4 4  0 0  1 1  3 2  1 1  2 0  2 1  0 0
## 147:     5    4250 6008.439   1 4  0 0  3 0  2 0  3 1  1 4  0 2  4 1  4 0  1 2
## 148:     5    8000 6387.890   0 1  3 1  4 4  0 3  0 1  3 2  1 2  3 0  4 1  4 0
## 149:     5    4550 5738.175   4 2  0 0  3 0  0 3  2 2  0 1  3 0  1 0  3 4  1 1
## 150:     5    5950 6113.967   4 2  1 3  3 2  1 2  0 1  0 4  2 1  1 3  2 2  3 4
table(res2[,Patch]);
##
##  1  2  3  4  5  6
## 25 25 25 25 25 25
bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • Try to avoid `do.call(rbind,...)` whenever possible. It is comparatively slow as a wet week, and can be avoided by using `tapply` on row indexes as per my initial comment. Rough benchmarks over here with 1M rows and 10K groups reduces the time from 38 seconds to 0.3 secs when using `tapply` and subsetting. – thelatemail Sep 02 '15 at 01:49