Count the number of rows within a certain time range based on each row in R (tidyverse)

Question

I want to count the number of rows within a certain time range based on each row after grouping by id. For instance, let us say a 1-month window around each datetime entry in the column "cleaned_date".

head(data$cleaned_date)

[1] "2004-10-11 CDT" "2008-09-10 CDT" "2011-10-25 CDT" "2011-12-31 CST"

The dates are in POSIXct format.

For the first entry, I need to count the number of rows within the time from 2004-09-11 to 2004-11-11, for the second entry, count the number of rows within the time from 2008-08-10 to 2008-10-10, so on and so forth.

I used roughly the following code

data %>% group_by(id) %>% filter(cleaned_date %within% interval(cleaned_date - 24 * 60 * 60 * 30, cleaned_date + 24 * 60 * 60 * 30)) %>% mutate(counts = n())

But it does not seem to work and I got counts as an empty column. Any help would be appreciated, thanks!

A reproducible example can be the following:

The input is

  cleaned_date id
1   2008-09-11  A
2   2008-09-10  B
3   2008-09-30  B
4   2011-10-25  A
5   2011-11-14  A

And I want the output to be

  cleaned_date id counts
1   2008-09-11  A      1
2   2008-09-10  B      2
3   2008-09-30  B      2
4   2011-10-25  A      2
5   2011-11-14  A      2

For the first entry, I want to count the rows in the timeframe 2008-08-11 to 2008-10-11, the second entry seems to satisfy but we need to group by "id", so it does not count. For the second entry I want to count the rows in the timeframe 2008-08-10 to 2008-10-10, rows 2 and 3 satisfy, so the counts is 2. For the third entry I want to count the rows in the timeframe 2008-08-30 to 2008-10-30, rows 2 and 3 satisfy again, so on and so forth.

Note that the actual dataset I would like to operate on has millions of rows, so it might be more efficient to use tidyverse rather than base R.

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Apr 22 '21 at 16:32

Snipeskies · Answer 1 · 2021-04-23T17:53:19.920

Perhaps not the most elegant solution.

# input data. Dates as character vector
input = data.frame(
    cleaned_date = c("2008-09-11", "2008-09-10", "2008-09-30", "2011-10-25", "2011-11-14"), 
    id = c("A", "B", "B", "A", "A")
    )

# function to create a date window n months around specified date
window <- function(x, n = 1){
    x <- rep(as.POSIXlt(x),2)
    x[1]$mon <- x[1]$mon - n
    x[2]$mon <- x[2]$mon + n
    return(format(seq(from = x[1], to = x[2], by = "day"), format="%Y-%m-%d"))
}

# find counts for each row
input$counts <- unlist(lapply(1:nrow(input), function(x){
    length(which((input$cleaned_date %in% window(input$cleaned_date[x])) & input$id == input$id[x]))
    }))

input

  cleaned_date id counts
1   2008-09-11  A      1
2   2008-09-10  B      2
3   2008-09-30  B      2
4   2011-10-25  A      2
5   2011-11-14  A      2

Edit for large datasets:

# dummy dataset with 1,000,000 rows
years <- c(2000:2020)
months <- c(1:12)
days <- c(1:20)
n <- 1000000
dates <- paste(sample(years, size = n, replace = T), sample(months, size = n, replace = T), sample(days, size = n, replace = T), sep = "-")
groups <- sample(c("A","B","C"), size = n, replace = T)
input <- data.frame(
    cleaned_date = dates,
    id = groups
)
input$cleaned_date <- format(as.POSIXlt(input$cleaned_date), format="%Y-%m-%d")

# optional, sort data by date for small boost in performance
input <- input[order(input$cleaned_date),]

counts <- NULL
#pb <- progress::progress_bar$new(total = length(unique(input$cleaned_date)))
t1 <- Sys.time()
# split up vectorization for each unique date.
for(date in unique(input$cleaned_date)){
    #pb$tick()
    w <- window(date)
    tmp <- input[which(input$cleaned_date %in% w),]
    tmp_counts <- unlist(lapply(which(tmp$cleaned_date == date), function(x){
        length(which(tmp$id == tmp$id[x]))
    }))
    counts <- c(counts, tmp_counts)
}
# add counts to dataset
input$counts <- counts 
# optional, re-order data to original format
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)

Time difference of 3.247204 mins

If you want to go faster, you can run the loop in parallel

library(foreach)
library(doParallel)

cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)

dates = unique(input$cleaned_date)
t1 <- Sys.time()
counts <- foreach(i=1:length(dates), .combine= "c") %dopar% {
    w <- window(dates[i])
    tmp <- input[which(input$cleaned_date %in% w),]
    tmp_counts <- unlist(lapply(which(tmp$cleaned_date == dates[i]), function(x){
        length(which(tmp$id == tmp$id[x]))
    }))
    tmp_counts
}
stopCluster(cl)
input$counts <- counts
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)

Time difference of 37.37211 secs

Note, I'm running this on a MacBook Pro with a 2.3 GHz Quad-Core Intel Core i7 and 16 GB of RAM.

Welcome to Stackoverflow. Enjoyed your answer and playing around with it at home, and always good to see `base` at work. The `window` function on an additional line, `> input[nrow(input)+1, ] <- NA` with 2011 date and B id really made it clear for me. — Chris, Apr 22 '21 at 23:57
Thank you so much! It worked for the example, but when I applied to my actual dataset, I got an error "Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'which': error in evaluating the argument 'x' in selecting a method for function 'format': 'to' must be a finite number Called from: h(simpleError(msg, call))" — TwoAlpha, Apr 23 '21 at 01:47
It might be because my dataset has too many rows (around 2 - 3 million rows) and base R might not be efficient enough. — TwoAlpha, Apr 23 '21 at 01:48
Thank you so much! Just to confirm, for "w <- window(date)", the window function is just what you defined earlier before editing? — TwoAlpha, Apr 25 '21 at 21:29
@Snipeskies, I used the window function you wrote before editing and got an error "Error during wrapup: 'origin' must be supplied Error: no more error handlers available (recursive errors?); invoking 'abort' restart" and it further pointed me to "Error in as.POSIXlt.numeric(x) : 'origin' must be supplied Called from: as.POSIXlt.numeric(x)". But if I ran your edited code directly without defining the window function, it gave me values for counts without any error. I know there is also a window function in R, is that what you were referring to or did you mean the window function you defined? — TwoAlpha, Apr 25 '21 at 21:55
@TwoAlpha, sorry for not being explicit, the window function I use here is the one I previously defined. I forgot there was a native R window function, you should probably rename the window function defined above to avoid conflicts. Also you might want to check the custom window function to make sure you are getting the right range. I seem to recall the last day in the rage was not included in seq(), so you need to add the last day to the vector. Should be straightforward. — Snipeskies, Apr 26 '21 at 16:35

score 0 · Answer 2 · answered Apr 22 '21 at 17:50

It is still hard to determine exactly what you're trying to accomplish, but this will at least get you counts for a specified date range:

df %>% 
  group_by(id) %>% 
  filter(cleaned_date >= "2008-08-11" & cleaned_date <= "2008-10-11") %>% 
  mutate(counts = n())

Will give us:

  cleaned_date id    counts
  <date>       <chr>  <int>
1 2008-09-11   A          1
2 2008-09-10   B          2
3 2008-09-30   B          2

Count the number of rows within a certain time range based on each row in R (tidyverse)

2 Answers2