0

Suppose I have the next dataframe. How can I create a new "avg" column that is the result of averaging the last 2 dates ("date") for each group. The idea is to apply this to a dataset with hundreds of thousands of files, so performance is important. The function should contemplate a variable number of months (example 2 or 3 months) and be able to change between simple and medium average.

Thanks in advance.

table1<-data.frame(group=c(1,1,1,1,2,2,2,2),date=c(201903,201902,201901,201812,201903,201902,201901,201812),price=c(10,30,50,20,2,10,9,20))


  group   date price
1     1 201903    10
2     1 201902    30
3     1 201901    50
4     1 201812    20
5     2 201903     2
6     2 201902    10
7     2 201901     9
8     2 201812    20


result<-data.frame(group=c(1,1,1,1,2,2,2,2),date=c(201903,201902,201901,201812,201903,201902,201901,201812),price=c(10,30,50,20,2,10,9,20), avg = c(20, 40, 35, NA, 6, 9.5, 14.5, NA))


  group   date price  avg
1     1 201903    10 20.0
2     1 201902    30 40.0
3     1 201901    50 35.0
4     1 201812    20   NA
5     2 201903     2  6.0
6     2 201902    10  9.5
7     2 201901     9 14.5
8     2 201812    20   NA
Raymont
  • 283
  • 3
  • 16

2 Answers2

1

If your date column is sorted, then hers's a way to do it using data.table:

library(data.table)
setDT(table1)[, next_price := dplyr::lead(price), by = group][, total_price := price + next_price][, avg := total_price / 2][, c("total_price", "next_price") := NULL]

table1

   group   date price  avg
1:     1 201903    10 20.0
2:     1 201902    30 40.0
3:     1 201901    50 35.0
4:     1 201812    20   NA
5:     2 201903     2  6.0
6:     2 201902    10  9.5
7:     2 201901     9 14.5
8:     2 201812    20   NA
sm925
  • 2,648
  • 1
  • 16
  • 28
  • And what if I wanted the time band to be dynamic? Example: calculating the average of the last 2 or 3 months – Raymont Nov 22 '19 at 16:04
  • You should be able to do that using `lead` function's `n` argument. In case of 3 months, you'll have to create an additional column, sum all of them up and take an average. – sm925 Nov 22 '19 at 16:21
1

sort the data.frame first so that date is ascending for each group

table1 <- table1[order(table1$group, table1$date), ]

create a moving average function with argument for number of months. other function options available from: Calculating moving average

mov_avg <- function(y, months = 2){as.numeric(filter(y, rep(1 / months, months), sides = 1))}

Use the classic do.call-lapply-split combo with this mov_avg function

table1$avg_2months <- do.call(c, lapply(split(x=table1$price, f=table1$group), mov_avg, months=2))
table1$avg_3months <- do.call(c, lapply(split(x=table1$price, f=table1$group), mov_avg, months=3))

table1

  group   date price avg_2months avg_3months
4     1 201812    20          NA          NA
3     1 201901    50        35.0          NA
2     1 201902    30        40.0    33.33333
1     1 201903    10        20.0    30.00000
8     2 201812    20          NA          NA
7     2 201901     9        14.5          NA
6     2 201902    10         9.5    13.00000
5     2 201903     2         6.0     7.00000
ThetaFC
  • 660
  • 3
  • 9
  • Strange, but I get the following. Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('double', 'numeric')" – Raymont Nov 22 '19 at 17:37
  • That is annoying. Restart your R session and don't use the ```dplyr``` library right now. ```filter_``` is a ```dplyr``` function and is "masking" the ```stats``` library ```filter``` function. – ThetaFC Nov 22 '19 at 18:02
  • Or alter the ```mov_avg``` function and replace ```filter``` with ```stats::filter``` to be explicit about which library you want this function from – ThetaFC Nov 22 '19 at 18:06