1

How would you solve the following problem in R / tidyverse?:

sample data:

tibble(
  date = seq(as.Date(paste0("2010-01-",runif(1,1,25))), by = "month", length.out = 24),
  machine_ID = sample(letters[1:10],size = 24,replace = T),
  machine_cat = rep(c(1,2),12)
)

objective:

Add a column called last6m, which counts the number of unique machine_IDs observed in the last 6 months, within the associated machine_cat.

tidyverse and no looping is prefered (purrr is ok).

Appriciate if anyone would take a quick look! Thanks in advance :-)

MariusJ
  • 71
  • 6
  • 2
    What is `machine_cat`? It doesn't appear in your example. – Allan Cameron Dec 05 '22 at 14:24
  • 4
    You can probably adapt this 30day version to 6month: https://stackoverflow.com/questions/71159724/count-observations-over-rolling-30-day-window – MrFlick Dec 05 '22 at 14:27
  • Thank you for your replies! Forgot to change variable name, should be right now. – MariusJ Dec 05 '22 at 14:31
  • 2
    This can be solved using a range-based join, as shown in [G. Grothendieck's answer](https://stackoverflow.com/a/71160265/3358272) on the question linked to by MrFlick. Looking at it that way, you can do it _currently_ in SQL (including `sqldf`), in `data.table`, and using `fuzzyjoin`; c.f., https://stackoverflow.com/q/64283351/3358272, https://stackoverflow.com/q/70584557/3358272, https://stackoverflow.com/q/64539945/3358272. **However**, `dplyr` will support it natively when version 1.1.0 is released, see https://www.tidyverse.org/blog/2022/11/dplyr-1-1-0-is-coming-soon/#join-improvements. – r2evans Dec 05 '22 at 15:03
  • 2
    Please check MrFlick's link and my other links to adapt to your data. Report back if you continue to have problems, otherwise this is a dupe question. Thanks! – r2evans Dec 05 '22 at 15:04
  • Thanks! I like the join solution, fast and simple. – MariusJ Dec 06 '22 at 09:57

1 Answers1

2

The following solution was obtained based on the post suggested by MrFlick and r2evans: G. Grothendieck's answer

library(tidyverse)
library(lubridate)
library(sqldf)

data <- tibble(
  date = seq(as.Date(paste0("2010-01-",runif(1,1,25))), by = "month", length.out = 24),
  machine_ID = sample(letters[1:10],size = 24,replace = T),
  machine_cat = rep(c(1,2),12)
)

sqldf("
  SELECT a.*, COUNT(distinct(b.machine_ID)) AS last6m
  FROM data a
  LEFT JOIN data b
  ON a.machine_cat = b.machine_cat
  AND (b.date between a.date - 180 AND a.date)
  GROUP BY a.rowid
") %>% arrange(machine_cat,date)
MariusJ
  • 71
  • 6
  • Add `order by machine_cat, date` to the end of the sql statement. Then we can remove `%>% arrange(...)` Also lubridate is not used anywhere in the answer so there should not be a `library` statement for it. – G. Grothendieck Dec 06 '22 at 12:57