2

There are 10 projects split between group A & B, each with different start and end dates. For each day within a given period the sum of outputX and outputY needs to be calculated. I manage to do this for all projects together, but how to split the results per group?

I've made several attempts with lapply() and purrr:map(), also looking at filters and splits, but to no avail. An example that doesn't distinguish between groups is found below.

library(tidyverse)
library(lubridate)

df <- data.frame(
  project = 1:10,
  group = c("A","B"),
  outputX = rnorm(2),
  outputY = rnorm(5),
  start_date = sample(seq(as.Date('2018-01-3'), as.Date('2018-1-13'), by="day"), 10),
  end_date = sample(seq(as.Date('2018-01-13'), as.Date('2018-01-31'), by="day"), 10))
df$interval <- interval(df$start_date, df$end_date)

period <- data.frame(date = seq(as.Date("2018-01-08"), as.Date("2018-01-17"), by = 1))

df_sum <- do.call(rbind, lapply(period$date, function(x){
  index <- x %within% df$interval;
  list("X" = sum(df$outputX[index]),
       "Y" = sum(df$outputY[index]))}))

outcome <- cbind(period, df_sum) %>% gather("id", "value", 2:3)

outcome

Ultimately, it should be a 40x4 table. Some suggestions are much appreciated!

MvB
  • 23
  • 3

1 Answers1

0

If I understand you correctly, you need to use inner join. SO can suggest us to use sqldf. See https://stackoverflow.com/a/11895368/9300556

With your data we can do smth like this. There is no need to calculate df$interval but we need to add ID to period, otherwise sqldf wont work.

df <- data.frame(
  project = 1:10,
  group = c("A","B"),
  outputX = rnorm(2),
  outputY = rnorm(5),
  start = sample(seq(as.Date('2018-01-3'), as.Date('2018-1-13'), by="day"), 10),
  end = sample(seq(as.Date('2018-01-13'), as.Date('2018-01-31'), by="day"), 10))
# df$interval <- interval(df$start_date, df$end_date)

period <- data.frame(date = seq(as.Date("2018-01-08"), as.Date("2018-01-17"), by = 1)) %>% 
  mutate(id = 1:nrow(.))

Then we can use sqldf

sqldf::sqldf("select * from period inner join df 
              on (period.date > df.start and period.date <= df.end) ") %>% 
  as_tibble() %>% 
  group_by(date, group) %>% 
  summarise(X = sum(outputX),
            Y = sum(outputY)) %>% 
  gather(id, value, -group, -date)
# A tibble: 40 x 4
# Groups:   date [10]
   date       group id    value
   <date>     <fct> <chr> <dbl>
 1 2018-01-08 A     X      3.04
 2 2018-01-08 B     X      2.34
 3 2018-01-09 A     X      3.04
 4 2018-01-09 B     X      3.51
 5 2018-01-10 A     X      3.04
 6 2018-01-10 B     X      4.68
 7 2018-01-11 A     X      4.05
 8 2018-01-11 B     X      4.68
 9 2018-01-12 A     X      4.05
10 2018-01-12 B     X      5.84
# ... with 30 more rows
atsyplenkov
  • 1,158
  • 13
  • 25