3

How can I draw n rows from a group where each group has a different number of rows?

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

I've tried,

library(dplyr)
outdat <- df %>% 
  group_by(color) %>% 
  sample_n(nrow(.), replace = TRUE)
outdat

but this returns a data.frame where nrow(.) are nrows from df and not the subset.

This SO post is close, but defines a specific number of row draws. I need it to be specific to group within dplyr.

www
  • 38,575
  • 12
  • 48
  • 84
Vedda
  • 7,066
  • 6
  • 42
  • 77
  • It's not clear to me how many rows you want to sample from each group, how this relates to the original number of rows per group. – Stuart Allen Dec 05 '17 at 02:42
  • how many rows you want to sample from your `df`? if you want 10 rows, you can use `sample_n(df, 10)` – myincas Dec 05 '17 at 02:46
  • @Snubian I want to sample the number of rows from the grouped data. – Vedda Dec 05 '17 at 02:48
  • @mt1022 I Tried `n()` and it can't be used directly. `Error: This function should not be called directly` – Vedda Dec 05 '17 at 02:49
  • @myincas I want to sample the number of rows defined in each group, so I don't want to specify exact samples. – Vedda Dec 05 '17 at 02:50
  • @RonakShah Think of a subset of data and the number of rows in that subset. I need it defined by each group because they vary across groups. – Vedda Dec 05 '17 at 02:52
  • 1
    @RonakShah The output data should have the same dimensions as the original data, but the observations may be different because of sampling with replacement. – Vedda Dec 05 '17 at 02:55

3 Answers3

4

Another workaround, use sample_frac:

outdat <- df %>%
    group_by(color) %>%
    sample_frac(1, replace = TRUE)
outdat
# # A tibble: 40 x 3
# # Groups:   color [4]
#             X1          X2 color
#          <dbl>       <dbl> <chr>
#  1  0.69256186  0.97180252  blue
#  2  1.54384827 -0.20268802  blue
#  3 -1.20068240 -0.45402013  blue
#  4  2.63407877 -0.31644247  blue
#  5  1.20716737 -0.91380874  blue
#  6  0.01067475  1.02004679  blue
#  7  0.01067475  1.02004679  blue
#  8  1.79732108 -0.04072946  blue
#  9  0.01067475  1.02004679  blue
# 10  1.79732108 -0.04072946  blue
# # ... with 30 more rows

Additionally, use outdat %>% ungroup() to remove grouping.

mt1022
  • 16,834
  • 5
  • 48
  • 71
  • This is more what I'm looking for because it can handle multiple groups. Thanks! – Vedda Dec 05 '17 at 03:02
  • 1
    I like this. This solution does not need to load the `purrr` package as my post does. I think it is the better solution. – www Dec 05 '17 at 03:03
3

Another solution using slice and sample.int. Reusing data from www:

outdat <- df %>% 
group_by(color) %>% 
slice(sample.int(n(),replace=T))
outdat

            X1          X2  color
1   1.71506499 -1.12310858   blue
2   0.07050839  2.16895597   blue
3   0.46091621 -0.40288484   blue
4   0.07050839  2.16895597   blue
5   0.07050839  2.16895597   blue
6   1.71506499 -1.12310858   blue
7  -1.26506123 -0.46665535   blue
8   1.55870831 -1.26539635   blue
9   0.12928774  1.20796200   blue
10  1.55870831 -1.26539635   blue
11  0.55391765 -0.28477301   pink
12 -0.29507148 -2.30916888   pink
13 -0.30596266  0.18130348   pink
14 -0.06191171 -1.22071771   pink
15  0.55391765 -0.28477301   pink
16  0.55391765 -0.28477301   pink
17  0.87813349 -0.70920076   pink
18  0.68864025  1.02557137   pink
19 -0.30596266  0.18130348   pink
20  0.68864025  1.02557137   pink
21  0.70135590  0.12385424    red
22  0.11068272  1.36860228    red
23 -1.96661716  0.58461375    red
24  0.40077145 -0.04287046    red
25  1.78691314  1.51647060    red
26 -0.55584113 -0.22577099    red
27  0.40077145 -0.04287046    red
28  1.78691314  1.51647060    red
29 -0.47279141  0.21594157    red
30 -0.47279141  0.21594157    red
31 -1.02600445 -0.33320738 yellow
32 -0.72889123 -1.01857538 yellow
33  1.25381492  2.05008469 yellow
34  0.83778704  0.44820978 yellow
35  1.25381492  2.05008469 yellow
36 -0.62503927 -1.07179123 yellow
37 -0.62503927 -1.07179123 yellow
38  0.83778704  0.44820978 yellow
39 -0.21797491 -0.50232345 yellow
40 -1.68669331  0.30352864 yellow
Lamia
  • 3,845
  • 1
  • 12
  • 19
2

A workaround using the purrr pakcage. It seems like the sample_n function cannot take n() as the size argument, probably because that argument does not take vectorized input. However, if we split the data frame by color as group, we can apply the sample_n with nrow() for each group.

# Set seed for reproducibility
set.seed(123)

# Create example data frame
df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

# Load packages
library(dplyr)
library(purrr)

outdat <- df %>%
  # Split the data frame by color
  split(.$color) %>%
  # Apply the sample_n function to all data frames
  map_dfr(~sample_n(., size = nrow(.), replace = TRUE))

outdat
#             X1          X2  color
# 1   1.71506499 -1.12310858   blue
# 2   0.07050839  2.16895597   blue
# 3   0.46091621 -0.40288484   blue
# 4   0.07050839  2.16895597   blue
# 5   0.07050839  2.16895597   blue
# 6   1.71506499 -1.12310858   blue
# 7  -1.26506123 -0.46665535   blue
# 8   1.55870831 -1.26539635   blue
# 9   0.12928774  1.20796200   blue
# 10  1.55870831 -1.26539635   blue
# 11  0.55391765 -0.28477301   pink
# 12 -0.29507148 -2.30916888   pink
# 13 -0.30596266  0.18130348   pink
# 14 -0.06191171 -1.22071771   pink
# 15  0.55391765 -0.28477301   pink
# 16  0.55391765 -0.28477301   pink
# 17  0.87813349 -0.70920076   pink
# 18  0.68864025  1.02557137   pink
# 19 -0.30596266  0.18130348   pink
# 20  0.68864025  1.02557137   pink
# 21  0.70135590  0.12385424    red
# 22  0.11068272  1.36860228    red
# 23 -1.96661716  0.58461375    red
# 24  0.40077145 -0.04287046    red
# 25  1.78691314  1.51647060    red
# 26 -0.55584113 -0.22577099    red
# 27  0.40077145 -0.04287046    red
# 28  1.78691314  1.51647060    red
# 29 -0.47279141  0.21594157    red
# 30 -0.47279141  0.21594157    red
# 31 -1.02600445 -0.33320738 yellow
# 32 -0.72889123 -1.01857538 yellow
# 33  1.25381492  2.05008469 yellow
# 34  0.83778704  0.44820978 yellow
# 35  1.25381492  2.05008469 yellow
# 36 -0.62503927 -1.07179123 yellow
# 37 -0.62503927 -1.07179123 yellow
# 38  0.83778704  0.44820978 yellow
# 39 -0.21797491 -0.50232345 yellow
# 40 -1.68669331  0.30352864 yellow
www
  • 38,575
  • 12
  • 48
  • 84
  • That's easy enough. Thanks! It is surprising to me `dplyr` can't handle this. Is there any way to incorporate multiple factors? – Vedda Dec 05 '17 at 02:57
  • 1
    Thanks. I feel the same. My first thought to this task is also put `n()` to the `size` argument, but it just returns `Error: This function should not be called directly`. – www Dec 05 '17 at 02:59