Creating column that takes mean of X for specific group id when date1 < date2

Question

in (R) for an event study I'm trying to create a column that calculates the mean of ccu_avg for a specific combination of appid and Eventdate1. One appid has multiple events so it has to be divided by both appid and Eventdate1. The difficult thing here is that I want it to calculate the mean only up until the event date since after the event happened the estimation period stops

The new column should look like est_ccu_avg: picture of the dataset below for explanation

https://i.stack.imgur.com/ZPquW.png

Could someone help me figure the code for this out? I've been trying for hours and can't seem to get it to work.

I've now been trying things like this but without success:

study <- study %>%
  mutate(est_ccu_avg=
                mean(study[unique(study$appid) | study$Eventdate1 > 
study$datefinal, "ccu_avg"])
            
            )

Result of dput head:

structure(list(appid = c("105600", "105600", "105600", "105600", 
"105600", "105600"), name = c("Terraria", "Terraria", "Terraria", 
"Terraria", "Terraria", "Terraria"), ccu_avg = c(26825, 29058, 
37842, 37525, 26484, 24377), ccu_min = c(21176, 21620, 28954, 
32880, 19648, 19118), ccu_max = c(35827, 41322, 50012, 44071, 
33241, 32060), pos_max = c(356186, 356363, 356508, 356712, 356921, 
357092), neg_max = c(6756, 6756, 6758, 6768, 6766, 6768), Maj_Upt = 
c(0, 
0, 0, 0, 0, 0), Min_Upt = c(0, 0, 0, 0, 0, 0), Hotfix = c(0, 
0, 0, 0, 0, 0), Bugfix = c(0, 0, 0, 0, 0, 0), Balance = c(0, 
0, 0, 0, 0, 0), ExpBranch = c(0, 0, 0, 0, 0, 0), Promo = c(0, 
1, 0, 0, 0, 0), Ev_Out = c(0, 0, 0, 0, 0, 0), Ev_In = c(0, 0, 
0, 0, 0, 0), isfree = c(0, 0, 0, 0, 0, 0), developers1 = c("Re- 
Logic", 
"Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic"), 
publishers1 = c("Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic", 
"Re-Logic", "Re-Logic"), metascore = c(83, 83, 83, 83, 83, 
83), singleplayer = c(1, 1, 1, 1, 1, 1), multiplayer = c(1, 
1, 1, 1, 1, 1), coop = c(1, 1, 1, 1, 1, 1), mmo = c(0, 0, 
0, 0, 0, 0), indie = c(1, 1, 1, 1, 1, 1), single_player_gen = c(0, 
0, 0, 0, 0, 0), adventure = c(1, 1, 1, 1, 1, 1), casual = c(0, 
0, 0, 0, 0, 0), strategy = c(0, 0, 0, 0, 0, 0), rpg = c(1, 
1, 1, 1, 1, 1), simulation = c(0, 0, 0, 0, 0, 0), multi_player_gen = 
c(0, 
0, 0, 0, 0, 0), shooter = c(0, 0, 0, 0, 0, 0), platformer = c(0, 
0, 0, 0, 0, 0), ea_min = c(0, 0, 0, 0, 0, 0), ea_max = c(0, 
0, 0, 0, 0, 0), scifi = c(0, 0, 0, 0, 0, 0), sports = c(0, 
0, 0, 0, 0, 0), racing = c(0, 0, 0, 0, 0, 0), inappurchase = c(0, 
0, 0, 0, 0, 0), workshop = c(0, 0, 0, 0, 0, 0), f_release_date = 
c("May 16, 2011", 
"May 16, 2011", "May 16, 2011", "May 16, 2011", "May 16, 2011", 
"May 16, 2011"), l_release_date = c("May 16, 2011", "May 16, 2011", 
"May 16, 2011", "May 16, 2011", "May 16, 2011", "May 16, 2011"
), datefinal = structure(c(18942, 18943, 18944, 18945, 18946, 
18947), class = "Date"), Eventdate = c("", "", "", "", "", 
""), Eventdate1 = structure(c(18949, 18949, 18949, 18949, 
18949, 18949), class = "Date"), est_ccu_avg = c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

Please share your data in code form - for example, by pasting the output of `dput(data)` into your question. We can't import an image into R. Also, can you share the code that you have tried to use to perform this task? — Andrea M, Jun 14 '22 at 10:27
Hi andrea, I'm trying to put the data in code form, since it's a pretty large dataset with 47 variables and 40k rows I can't really copy the output of dput. Is there another way to do this? — Sjoerd S, Jun 14 '22 at 10:37
Ideally you provide a minimal version of your dataset - only the columns and the rows that are necessary to reproduce (and fix) the issue. `dput(head(data))` will return all columns and just for the first 10 rows. You can also subset the columns, or use `data.frame()` to create a sample dataframe from scratch. More info: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Andrea M, Jun 14 '22 at 10:41
I've included the heads of the dataset, they are visible in the question now, thanks! — Sjoerd S, Jun 14 '22 at 10:44

score 0 · Answer 1 · answered Jun 14 '22 at 16:06

I figured it out, there probably is an easier way but this is how I did it:

# first make a list with only the rows where eventdate > datefinal to only 
include estimation period.
estmeans <- study[study$Eventdate1 > study$datefinal,]

# calculate means per appid and eventdate
studymeans <- aggregate(estmeans$ccu_avg, list(estmeans$appid, 
estmeans$Eventdate1), mean)

# change the names for merging 
names(studymeans)[1] <- 'appid'
names(studymeans)[2] <- 'Eventdate1'
names(studymeans)[3] <- 'est_ccu_avg'

# merge the dataframes, it creates 2 new columns, delete the empty one.
studynew <- merge(study, studymeans, by=c("appid", "Eventdate1"))
studynew$est_ccu_avg.x <- NULL

why do you estimate the mean of the values on days AFTER datefinal? In your post, you indicate needing the mean of the values up to the datefinal? — langtang, Jun 14 '22 at 22:00
I needed the mean of the values from datefinal up to Eventdate. so if I'm not mistaken it is correct like this — Sjoerd S, Jun 15 '22 at 07:54

langtang · Answer 2 · 2022-06-15T01:01:02.350

0

You can leverage the special .BY, to refer to the grouping variable, when using data.table

library(data.table)
setDT(df)[, mean(ccu_avg[datefinal<=.BY$Eventdate1]), by=.(appid, Eventdate1)]

The equivalent in dplyr is cur_group().

df %>%
  group_by(appid,Eventdate1) %>%
  summarize(res = mean(ccu_avg[datefinal<=cur_group()$Eventdate1))

edited Jun 15 '22 at 01:01

answered Jun 14 '22 at 22:03

langtang

22,248
1
12
27

Creating column that takes mean of X for specific group id when date1 < date2

2 Answers2