0

I have two columns with start and end dates of every week. I need to aggregate other column on monthly basis by the mean of the weeks of particular month (I have 3 years in dataset) and create another column that will contain weight for the whole month (so it will be the same value for 5-6 weeks, depending how many weeks particular month have for particular ID (I have thousands of id's in dataset). Tricky part is that some of the weeks are overlapping, so that one row sometimes but be taken into calculation of both months eg. when we have start_date = 2020-07-27 and end_date = 2020-08-09 (It has to be taken both to July and August month). This is my data:

ID weight start_date end_date
60 1,2 2019-12-30 2020-01-05
60 1,4 2020-01-06 2020-01-12
60 1,3 2020-01-13 2020-01-19
60 1,0 2020-01-20 2020-01-26
60 3,8 2020-01-27 2020-02-02
61 1,7 2019-12-30 2020-01-05
61 12,9 2020-01-06 2020-01-12

I want to obtain:

ID weight start_date end_date Monthy_weight Month
60 1,2 2020-12-30 2020-01-05 1,74 01.2020
60 1,4 2020-01-06 2020-01-12 1,74 01.2020
60 1,3 2020-01-13 2020-01-19 1,74 01.2020
60 1,0 2020-01-20 2020-01-26 1,74 01.2020
60 3,8 2020-01-27 2020-02-02 1,74 01.2020
61 1,7 2020-12-30 2020-01-05 7,3 01.2020
61 12,9 2020-01-06 2020-01-12 7,3 01.2020

Firstly I wanted to do a loop that will detect every month in both columns and if the month appears, it will take the mean from other column, but then I found similar problem on stack overflow (How to convert weekly data into monthly data?) and decided to do it with zoo.

I tried to implement solution from the above post:

library(zoo)
z.st <- read.zoo(long_weights[c("start_date", "weight")])
z.en <- read.zoo(long_weights[c("end_date", "weight")])
z <- c(z.st, z.en)

g <- zoo(, seq(start(z), end(z), "day"))
m <- na.locf(merge(z, g))
aggregate(m, as.yearmon, mean)

but after this line:

z <- c(z.st, z.en)

Im obtaining an error: Error in bind.zoo(...) : indexes overlap

I also tried, but this not takes into consideration overlapping weeks:

df <- df %>% group_by(HHKEY, month = floor_date((as.Date(end_date)- as.Date(start_date))/2 + as.Date(start_date), "month")) %>% mutate(monthly_weight = mean(weight), .after = end_date, month = format(month, "%Y.%m")) %>% ungroup()
Fendi
  • 83
  • 6
  • (1) Starting on `2020-12-30` and ending on `2020-01-05` is backwards to me, should that be `2019-12-30` instead? (2) Are you saying that your monthly average of `weight` for row 1 occurs at some point in `2020-01`, then it takes all weights from that month and averages them? What happens to the average of weights in the other month? That first row could have a December average as well. – r2evans Jan 25 '23 at 13:21
  • Yeah, that is a mistake I made when writing this table here. In the data I have 2019-12-30, will edit the table here. – Fendi Jan 25 '23 at 13:23

1 Answers1

0

A possible solution may be to get the start_date per month when they differ (at the end of a month) as end date for the grouping variable month. Extended the data to include a year change within an ID.

library(dplyr)

df %>% 
  group_by(ID) %>% 
  mutate(start_date = as.Date(start_date), end_date = as.Date(end_date), 
         month = lead(format(start_date, "%m.%Y")), 
         month = if_else(is.na(month), 
           format(start_date, "%m.%Y"), format(end_date, "%m.%Y"))) %>% 
  group_by(ID, month) %>% 
  mutate(monthly_weight = mean(weight), .before=month) %>% 
  ungroup()
# A tibble: 14 × 6
      ID weight start_date end_date   monthly_weight month  
   <dbl>  <dbl> <date>     <date>              <dbl> <chr>  
 1    60    1.2 2019-12-30 2020-01-05           1.74 01.2020
 2    60    1.4 2020-01-06 2020-01-12           1.74 01.2020
 3    60    1.3 2020-01-13 2020-01-19           1.74 01.2020
 4    60    1   2020-01-20 2020-01-26           1.74 01.2020
 5    60    3.8 2020-01-27 2020-02-02           1.74 01.2020
 6    61    1.7 2019-12-30 2020-01-05           7.3  01.2020
 7    61   12.9 2020-01-06 2020-01-12           7.3  01.2020
 8    61    1.2 2020-12-29 2021-01-04           1.74 01.2021
 9    61    1.4 2021-01-05 2021-01-11           1.74 01.2021
10    61    1.3 2021-01-12 2021-01-18           1.74 01.2021
11    61    1   2021-01-19 2021-01-25           1.74 01.2021
12    61    3.8 2021-01-26 2021-02-01           1.74 01.2021
13    63    1.7 2020-12-29 2021-01-04           7.3  01.2021
14    63   12.9 2021-01-05 2021-01-11           7.3  01.2021

extended data

df <- structure(list(ID = c(60, 60, 60, 60, 60, 61, 61, 61, 61, 61, 
61, 61, 63, 63), weight = c(1.2, 1.4, 1.3, 1, 3.8, 1.7, 12.9, 
1.2, 1.4, 1.3, 1, 3.8, 1.7, 12.9), start_date = structure(c(18260, 
18267, 18274, 18281, 18288, 18260, 18267, 18625, 18632, 18639, 
18646, 18653, 18625, 18632), class = "Date"), end_date = structure(c(18266, 
18273, 18280, 18287, 18294, 18266, 18273, 18631, 18638, 18645, 
18652, 18659, 18631, 18638), class = "Date")), row.names = c(NA, 
-14L), class = c("tbl_df", "tbl", "data.frame"))
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
  • Hello, thank you very much for answer. I have a question, because I have many years in my dataset and it seems that this solution doesn't takes into consideration only months, not years, so that even if I have weights of 0 for many weeks, I still obtain some monthly_weight >0 , because is taking the weight column for the same months of different year, where the weights are present. How I can add this recognition of a year to this solution? – Fendi Jan 25 '23 at 15:32
  • @Fendi Easy to implement, do you also want to display the additional grouping value, e.g. as column "Year" or within "month", or not at all? – Andre Wildberg Jan 25 '23 at 15:39
  • Ideally it would be if already created column "month" have the month and the year eg. 2020.08, 2021.08, 2022.08 – Fendi Jan 25 '23 at 15:42
  • @Fendi Ah, right, it already has the year in column month, which it takes as additional grouping value. So it should work as is. If still in doubt, you can update the question with an example that includes an additional year and i'll check! – Andre Wildberg Jan 25 '23 at 15:48
  • @Fendi I see, the problem arises when one ID switches from one year to the other. Gonna edit! – Andre Wildberg Jan 25 '23 at 16:06
  • @Fendi I changed the code and added an extended dataset, see if it works for your original data now. – Andre Wildberg Jan 25 '23 at 17:04