Random sample by group and filtering on the basis of result

Question

I have a dataframe that is generated by the following code

l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)

dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)), 
c("ids", "months", "calWeek", "value"))

and looks like this:

+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
|  1 |     5 |   201708 |   4.5 |
|  1 |     5 |   201709 |     5 |
| 1  |     5 |   201710 |     6 |
|  2 |    88 |   201741 |    75 |
|  2 |    88 | 201742   |    89 |
| 2  |    88 | 201743   |    90 |
|  2 |    88 |   201744 |    51 |
+----+-------+----------+-------+

I would like to randomly sample a calendar week from each id-month group (the months are not calendar months). Then I would like to keep all id-month combination prior to the sample months.

An example output could be: suppose the sampling output returned cal week 201743 for the group id=2 and month=88 and 201709 for the group id=1 and month=5, then the final ouput should be

+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
|  1 |     5 |   201708 |   4.5 |
|  1 |     5 |   201709 |     5 |
|  2 |    88 |   201741 |    75 |
|  2 |    88 | 201742   |    89 |
   2  |    88 | 201743   |    90 |

+----+-------+----------+-------+

I tried to work with dplyr's sample_n function (which is going to give me the random calendar week by id-month group, but then I do not know how to get all calendar weeks prior to that date. Can you help me with this. If possible, I would like to work with dplyr.

Please let me know in case you need further information.

Many thanks

OK, forget my just deleted comment. I think your question looks interesting, but we would really appreciate a sample data frame... (you should know that :) — tjebo, Mar 22 '18 at 08:51
As an idea. By sampling you will get a vector of your sampled calendar weeks. Now you can use this vector in order to define maxima in the cal week column of your original data frame and filter out those which are higher (or lower). — tjebo, Mar 22 '18 at 08:54
Hi Tjebo, thanks I added some code. Can you explain a bit more how to find the maxima in the cal week. Thx. Also, the dataframe is somewhat large (about 12 000 000 rows). I am actually having some performance problems with just creating the random sample (i.e. with sample_n). — clog14, Mar 22 '18 at 08:59
To be clear, you only want to sample from one specific Cal Week, and keep the id-month combination from the row before this Cal Week ? — Colin FAY, Mar 22 '18 at 09:00
Hi Colin, I want to sample a calendar week within id-month group and then keep all calendar weeks below the sampled one. — clog14, Mar 22 '18 at 09:01
wow, I had to learn how to sample from a group first. Well done... Would have been good to show this in your code, how you have used sample_n... https://stackoverflow.com/questions/21255366/sample-rows-of-subgroups-from-dataframe-with-dplyr — tjebo, Mar 22 '18 at 09:12
as to your problem with very large data - in the question which I was referring to, people are very keen on data table. This might be the solution for your problem - I don't work with data table though — tjebo, Mar 22 '18 at 09:51

score 1 · Answer 1 · answered Mar 22 '18 at 09:13

1

That should do the trick:

sample_and_get_below <- function(df, when, size){
  res <- filter(df, calWeek == when) %>%
    sample_n(size)
  filter(df, calWeek > when) %>%
    rbind(res, .)
}
sample_and_get_below(dat, 201741, 1)
  ids months calWeek value
1   2     88  201741    99
2   2     88  201742   100
3   2     88  201743  1001
4   2     88  201744  1002

answered Mar 22 '18 at 09:13

Colin FAY

4,849
1
12
29

Hi, thanks but this looks incorrect. I would expect boths ids in the result. Then the sampling would return for instance for id2 and month 88 the cal week 201742, then i would want the rows corresponding to id2 and month 88 and calWeek 201741 and 201742. – clog14 Mar 22 '18 at 09:18
Sorry I have a hard time understanding what you want to get as a result. Could you please provide an example output in the question? – Colin FAY Mar 22 '18 at 09:21

tjebo · Accepted Answer · 2018-03-22T09:47:43.060

1

require(dplyr) 
set.seed(1)     # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1)) 

sampled_day <- sampled$calWeek

dat %>% group_by(ids) %>% 
  mutate(max_day = which(calWeek %in% sampled_day)) %>%
  filter(row_number() <= max_day)

#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)

# A tibble: 3 x 4
# Groups:   ids [2]
    ids months calWeek  value
  <dbl>  <dbl>   <dbl>  <dbl>
1  1.00   5.00  201708   5.00
2  2.00  88.0   201741  99.0 
3  2.00  88.0   201742 100

This depends on the row order! So make sure to arrange by day first. You'll need to think about ties, though. I have edited my previous answer and simply filtered with <=

edited Mar 22 '18 at 09:47

answered Mar 22 '18 at 09:39

tjebo

21,977
7
58
94

Hi Tjebo, thanks. I would actually have to group on ids and months (because of how the real dataframe is structured, but I suppose this does not matter much). Additionally, I noticed that this code takes very long on my 12 million rows dataframe. I suppose this is from the sampling, and there is not much one can do? – clog14 Mar 22 '18 at 11:18
as per comment above, you might want to consider data table ? – tjebo Mar 22 '18 at 11:24

Random sample by group and filtering on the basis of result

2 Answers2