1

Consider the following DataFrame:

    DF = structure(list(c_number = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L), date = c("2001-01-06", "2001-01-07", "2001-01-08", 
"2001-01-09", "2001-01-10", "2001-01-11", "2001-01-12", "2001-01-13", 
"2001-01-14", "2001-01-15", "2001-01-16", "2001-01-17", "2001-01-18", 
"2001-01-19", "2001-01-20", "2001-01-21", "2001-01-22", "2001-01-23", 
"2001-01-24", "2001-01-25", "2001-01-26", "2001-01-11", "2001-01-12", 
"2001-01-13", "2001-01-14", "2001-01-15", "2001-01-16", "2001-01-17", 
"2001-01-18", "2001-01-19", "2001-01-20", "2001-01-21", "2001-01-22", 
"2001-01-23", "2001-01-24", "2001-01-25", "2001-01-26", "2001-01-27", 
"2001-01-28", "2001-01-12", "2001-01-13", "2001-01-14", "2001-01-15", 
"2001-01-16", "2001-01-17", "2001-01-18", "2001-01-19", "2001-01-20", 
"2001-01-21", "2001-01-22", "2001-01-23", "2001-01-24", "2001-01-25", 
"2001-01-26", "2001-01-27", "2001-01-28", "2001-01-29", "2001-01-30", 
"2001-01-21", "2001-01-22", "2001-01-23", "2001-01-24", "2001-01-25", 
"2001-01-26", "2001-01-27", "2001-01-28", "2001-01-29", "2001-01-30", 
"2001-01-31", "2001-01-24", "2001-01-25", "2001-01-26", "2001-01-27", 
"2001-01-28", "2001-01-29", "2001-01-30", "2001-01-31", "2001-02-01"
), value = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("c_number", 
"date", "value"), row.names = c(NA, -78L), class = "data.frame")

I have sales data for 5 customer on consecutive dates; For customer 1, I have sales data on 21 consecutive dates....for customer # 5, I have sales data on 9 consecutive dates...:

> table(DF[, 1])

 1  2  3  4  5 
21 18 19 11  9

For each customer I want to sample a sub DF of 15 consecutive days (If I have at least 15 consecutive dates for that customer) or all dates for that customer (if I don't have 15 consecutive dates for that customer).

The key part is that in case 1 (If I have at least 15 consecutive dates for that customer) those 15 consecutive days should have a random start date (e.g. not always be the first or last 15 dates for an customer) to avoid introducing a bias in the analysis.

In plain R I would do:

library(dplyr)

slow_function <- function(i, DF, length_out = 15){
  sub_DF = DF[DF$c_number == i, ]
  if(nrow(sub_DF) <= length_out){
    out_DF = sub_DF
  } else {
    random_start = sample.int(nrow(sub_DF) - length_out, 1)
    out_DF = sub_DF[random_start:(random_start + length_out - 1), ]
  }
}
a_out = lapply(1:nrow(a_1), slow_function, DF = DF, length_out = 15)
a_out = dplyr::bind_rows(a_out)


table(a_out[, 1])
 1  2  3  4  5 
15 15 15 11  9 

But my data is much larger and the operation above unbearably slow. Is there a fast way to obtain the same result in data.table/dplyr?

Edit: code to generate the data.

num_customer = 10
m   = 2 * num_customer
a_0 = seq(as.Date("2001-01-01"), as.Date("2001-12-31"), by = "day")
a_1 = matrix(sort(sample(as.character(a_0), m)), nc = 2)
a_2 = list()
for(i in 1:nrow(a_1)){
  a_3 = seq(as.Date(a_1[i, 1]), as.Date(a_1[i, 2]), by = "day")
  a_4 = data.frame(i, as.character(a_3), round(runif(length(a_3), 1)))
  colnames(a_4) = c("c_number", "date", "value")
  a_2[[i]] = a_4
}
DF = dplyr::bind_rows(a_2)
dim(DF)
table(DF[, 1])
dput(DF)

Edit2:

on a 100k customer DF, Christoph Wolk's solution is the fastest. Next is G. Grothendieck's (about 4 times more time), next is Nathan Werth's (another factor of 2 slower than G. Grothendieck's). The other solutions are noticeably slower. Still, all proposals are faster than my tentative 'slow_function' so thanks to everyone!

user189035
  • 5,589
  • 13
  • 52
  • 112
  • 1
    Question is a little unclear. For each employee, you want to choose a random starting date and sample a maximum of 15 consecutive days after that starting point? Or, if the random choice would result in less than 15 data points for the employee, just take the last 15? – jdobres Sep 01 '17 at 19:51
  • @jdobres: thanks for asking. Actually the second interpretation ('if the random choice would result in less than 15 data points for the employee, just take the last 15?') is what I want. – user189035 Sep 01 '17 at 19:55

5 Answers5

2

A way to speed up in base R might be to just work with indices rather than the whole data.frame before subsetting.

output = DF[unlist(lapply(
            split(1:NROW(DF), DF$c_number),  #Split indices along rows of DF
            function(x){
                if(length(x) < 15){          #Grab all indices if there are less than 15
                    x
                } else{
                    #Grab an index randomly such that there will be 14 more left after it
                    x[sample(0:(length(x) - 15), 1) + sequence(15)]
                }
            })),
            ]

sapply(split(output, output$c_number), NROW)
# 1  2  3  4  5 
#15 15 15 11  9 
d.b
  • 32,245
  • 6
  • 36
  • 77
  • How is this using data.table/dplyr to speed things up? Or do you mean this problem can't be sped up using data.table/dplyr? – user189035 Sep 01 '17 at 19:51
  • `lapply` is more or less a wrapper for a standard R loop, which is slow. The dplyr and data.table packages utilize a lot of faster C code to speed up basic operations. – jdobres Sep 01 '17 at 19:54
  • 1
    @jdobres - `lapply` uses a C loop. The last line of the `lapply` source shows it being dispatched into the internal C code. `.Internal(lapply(X, FUN))` It is not slow when used properly. – Rich Scriven Sep 01 '17 at 20:02
  • @RichScriven Huh. Didn't know that. Thanks! – jdobres Sep 01 '17 at 20:04
  • 1
    Thanks. I did not know that about lapply either. I find G. Grothendieck's answer a bit more intuitive (but this is subjective), but thanks! – user189035 Sep 01 '17 at 20:10
2

Try this:

sample15consecutive <- function(DF) {
runs <- rle(DF$c_number)$lengths
start <- ifelse(runs > 15, sapply(pmax(runs-15, 1), sample.int, size=1), 1)
end <- ifelse(runs >= 15, 15, runs)
previous <- cumsum(c(0, head(runs, -1)))
DF[unlist(mapply(seq, previous + start, previous + start + end - 1), length),]
}

It's about 4 times faster according to microbenchmark. The c_numbers and dates have to be sorted.

Christoph Wolk
  • 1,748
  • 1
  • 7
  • 12
  • Thanks (+1)! But wouldn't G. Grothendieck's answer be faster (much less overheads as he calls fewer O(n) functions)? – user189035 Sep 01 '17 at 20:16
  • 1
    I can't say. In my microbenchmark, my version seems a little faster, but I might have made a mistake, or the behavior might depend on the specifics of the data - on the sample dataset all method are really fast. You could test it on your large data set. – Christoph Wolk Sep 01 '17 at 20:32
  • Indeed, I tried again and I get a 4 times improvement on a larger data set. Thanks again for pointing the perf out! – user189035 Sep 01 '17 at 21:04
1

This is pretty straightforward with the tidyverse packages (specifically, dplyr and tidyr).

library(tidyverse)

df.sample <- arrange(DF, date) %>% 
  group_by(c_number) %>% 
  do(head(., 15))

Output (first 30 rows / 2 employees):

# A tibble: 65 x 3
   c_number       date value
      <int>      <chr> <dbl>
 1        1 2001-01-06     1
 2        1 2001-01-07     1
 3        1 2001-01-08     1
 4        1 2001-01-09     1
 5        1 2001-01-10     1
 6        1 2001-01-11     1
 7        1 2001-01-12     1
 8        1 2001-01-13     1
 9        1 2001-01-14     1
10        1 2001-01-15     1
11        1 2001-01-16     1
12        1 2001-01-17     1
13        1 2001-01-18     1
14        1 2001-01-19     1
15        1 2001-01-20     1
16        2 2001-01-11     1
17        2 2001-01-12     1
18        2 2001-01-13     1
19        2 2001-01-14     1
20        2 2001-01-15     1
21        2 2001-01-16     1
22        2 2001-01-17     1
23        2 2001-01-18     1
24        2 2001-01-19     1
25        2 2001-01-20     1
26        2 2001-01-21     1
27        2 2001-01-22     1
28        2 2001-01-23     1
29        2 2001-01-24     1
30        2 2001-01-25     1
# ... with 35 more rows

Edit: the following selects a random start date for each employee and then selects up to 15 consecutive days after the randomly chosen point:

df.sample <- arrange(DF, date) %>% 
  group_by(c_number) %>% 
  mutate(date = as.Date(date), start = sample(date, 1)) %>% 
  filter(date >= start & date <= (start + 14))
jdobres
  • 11,339
  • 1
  • 17
  • 37
  • Not the same, because you alway take the last 15 rows for each customer. But I want 30 consecutive rows from a random start date. I reworded the title to make this more clear (mea maxima culpa) – user189035 Sep 01 '17 at 19:52
  • Updated to better match your intent. Can't guarantee that it's the fastest, but it's very readable. – jdobres Sep 01 '17 at 20:05
  • this (after the edit) is exactly the answer to my question (+1). But d.b and G. Grothendieck's answers are showing I was not asking the correct question. – user189035 Sep 01 '17 at 20:13
1

samp generates a vector of 1 (in sample) and 0 (out of sample) and we subset by that. I haven't benchmarked it but it does not break up DF into sub-dataframes but only splits the c_number vector and then does a single subset on the original DF.

samp <- function(x) {
  n <- length(x)
  replace(0*x, seq(sample(max(n - 15, 1), 1), length = min(n, 15)), 1)
}
s <- subset(DF, ave(c_number, c_number, FUN = samp) == 1)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

Try this:

library(data.table)

setDT(DF)

DF[
  ,
  {
    if (.N <= 15) {
      # 15 or fewer rows? Grab them all.
      .SD
    } else {
      # Grab a random starting row not too close to the end
      random_start <- sample(seq_len(.N - 14), size = 1)
      .SD[random_start + 0:14]
    }
  },
  by = c_number
]
Nathan Werth
  • 5,093
  • 18
  • 25
  • this is exactly the answer to my question (+1). But d.b and G. Grothendieck's answers are showing I was not asking the correct question. – user189035 Sep 01 '17 at 20:08