-3

I would like to calculate the mean value of the ratio [op/(tr - lag(tr))] for each year for the following data

structure(list(gvkey = c(1000L, 1000L, 1000L, 1000L, 1000L, 1001L, 
1001L, 1001L, 1003L, 1003L, 1003L, 1003L, 1003L, 1003L, 1003L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1005L, 1005L, 1005L, 1005L, 1005L, 1005L, 
1005L, 1005L, 1006L, 1006L, 1006L, 1006L, 1006L, 1006L, 1006L, 
1006L, 1006L, 1007L, 1007L, 1007L, 1007L, 1007L, 1007L, 1007L, 
1007L, 1007L, 1007L), fyear = c(1973L, 1974L, 1975L, 1976L, 1977L, 
1983L, 1984L, 1985L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 
1989L, 1973L, 1974L, 1975L, 1976L, 1977L, 1978L, 1979L, 1980L, 
1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 
1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 
1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 
2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 
2017L, 2018L, 2019L, 2020L, 1974L, 1975L, 1976L, 1977L, 1978L, 
1979L, 1980L, 1981L, 1974L, 1975L, 1976L, 1977L, 1978L, 1979L, 
1980L, 1981L, 1982L, 1974L, 1975L, 1977L, 1978L, 1979L, 1980L, 
1981L, 1982L, 1983L, 1984L), ppent = c(7.408, 5.691, 7.263, 16.44, 
19.917, 8.536, 12.715, 22.411, 0.256, 0.424, 0.893, 1.036, 1.179, 
1.076, 0.654, 4.258, 4.746, 5.652, 6.016, 8.272, 11.142, 13.548, 
14.405, 32.531, 31.946, 33.648, 34.466, 38.614, 44.342, 46.938, 
50.77, 63.441, 63.415, 60.422, 56.052, 54.783, 56.596, 54.831, 
71.108, 82.905, 104.012, 110.003, 108.907, 102.591, 166.761, 
166.137, 139.137, 213.38, 260.167, 310.393, 245.586, 334.43, 
417.764, 456.015, 426.4, 413.3, 295, 313.9, 361.5, 316.6, 348.8, 
437.1, 380.1, 1.159, 1.269, 1.348, 1.498, 2.292, 2.793, 4.908, 
6.662, 0.281, 0.22, 0.238, 0.379, 0.385, 0.362, 0.322, 0.281, 
0.381, 0.367, 0.315, 0.258, 0.238, 0.203, 0.178, 0.17, 0.095, 
0.064, 0.286), aqc = c(0, 1.652, 0, 5.443, 5.447, 0.238, 0.034, 
17.931, 0, 0, 2.875, 0, 0, 0, 0, 0.653, 0, 0, 0, 2.077, 0, 0, 
0.629, 18.727, 1.713, 1.152, 0, 0, 0, 2.749, 2.397, 0, 0, 0, 
0, 0, 0, 0, 0, 28.148, 15.175, 0, 3.2, 13.251, 0, 0, 0, 0, 38.478, 
85.21, 0, 193.989, 0, 298.087, 21.3, 15.3, 1, 0, 12.5, 22.9, 
2.3, 0, 0, 0, 0, 0, 0, 0, 0.133, 0.107, 0.142, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
90L), class = "data.frame")

I would like an answer using dplyr

I apply the following code as suggested but it gives me wrong values as shown below. I have tried to calculate the median and then it gives me 0 for all years.

medianaqcproportion <- cleanData %>%
  group_by(gvkey) %>%
  mutate(lagppent = lag(ppent)) %>%
  group_by(fyear) %>%
  summarize(ans = mean(aqc / (ppent - lagppent), na.rm = TRUE), .groups = "drop")

enter image description here

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • What did you try? Where did you stuck? – maydin Jan 19 '23 at 13:23
  • This is almost certainly a duplicate of [summarize by group](https://stackoverflow.com/q/11562656/3358272). – r2evans Jan 19 '23 at 13:26
  • I have gone as far as to add another column with the ratio. But then I do not know how to tell R to get the mean value for each year... – Ayoze Alfageme Jan 19 '23 at 13:26
  • 1
    what is `tr(-1)`, is it `1/tr`? IOW, does your ratio reduce to `(op - 1)/tr`? – r2evans Jan 19 '23 at 13:27
  • the ratio is op over the diferrence between tr the same period and tr of the previous period – Ayoze Alfageme Jan 19 '23 at 13:28
  • A "lag" function? Okay, please [edit] your question for two things: be explicit about `tr(-1)`, it is not intuitive nor assumed; and you need to get your order of operators correct, since to me this means divide `op` by `tr`, and then subtract from that ratio `tr(-1)`. I suspect you need parens around the `diff(tr)`. Next question: what value should `tr - lag(tr)` have for the first year? The default will be `NA`, are you good with that? – r2evans Jan 19 '23 at 13:32
  • Thank you, @r2evans! Edited. Both NA or omitting the first year ratio are ok. – Ayoze Alfageme Jan 19 '23 at 13:36
  • Does this answer your question? [Calculate the mean by group](https://stackoverflow.com/questions/11562656/calculate-the-mean-by-group) – NelsonGon Jan 19 '23 at 13:37
  • @AyozeAlfageme, this really is a dupe of the "calc mean by group", albeit with just a little pre-calc. If my answer works for you, even though the question has been closed, you can still accept it if you like, or comment on it if something is not clear or not quite right. – r2evans Jan 19 '23 at 13:40

1 Answers1

1

Interpretation:

  • tr(-1) is a lag, ordered by year, grouped by id
  • [op/tr - tr(-1)] is actually op/(tr - lag(tr)) (grouping and naming 'lag')

This may be broken, but it's a first attempt:

library(dplyr)
quux %>%
  group_by(id) %>%
  mutate(ans = op / (tr - lag(tr))) %>%
  group_by(year) %>%
  summarize(ans = mean(ans, na.rm = TRUE), .groups = "drop")
# # A tibble: 5 × 2
#   year     ans
#   <chr>  <dbl>
# 1 1984  NaN   
# 2 1985    2   
# 3 1986    7.5 
# 4 1987    8   
# 5 1988    8.67

I'm inferring that the lag(tr) you reference should be grouped by id, so I group and calculate that lag first. After that, I group by year and summarize on the ratio.


Data

quux <- structure(list(year = c("1984", "1985", "1986", "1987", "1988", "1985", "1986", "1987", "1988", "1985", "1986", "1986", "1987", "1988"), id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4), op = c(10, 20, 30, 40, 50, 15, 17, 18, 19, 20, 22, 10, 20, 40), tr = c(10, 20, 30, 40, 50, 15, 17, 18, 19, 20, 22, 10, 20, 40)), class = "data.frame", row.names = c(NA, -14L))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thanks! that actually works also for my much larger database. – Ayoze Alfageme Jan 19 '23 at 13:46
  • Also, I could skip the mutate part and calculate the ratio as ```mean(op / (tr - lag(tr))```. It does not change much, it perhaps simplyfies it a bit – Ayoze Alfageme Jan 19 '23 at 13:50
  • _That is (almost certainly) wrong_. If you do `lag(tr)` while in a `group_by(year)`, then your lag will be across different `id`s. Granted, I'm assuming that you want previous year by id. I suggest before you go any further that you **manually** calculate some of the values you expect and compare with my (double-`group_by`) and your (single-`group_by`) methods to see which is correct. – r2evans Jan 19 '23 at 13:58
  • you are of course right. I have re-edited the question since I have found further issues. Sorry for my messy way of asking. Not an expert here... and many, mnay thanks for your help @r2evans – Ayoze Alfageme Jan 19 '23 at 14:27
  • 1973 is `NaN` because there are insufficient observations of that year; 1975 (and others) are `Inf` because the ratio is `0`. I've edited the answer slightly (the `lag(.)` is still per-`id`), I suggest you look at code such as `cleanData %>% group_by(gvkey) %>% mutate(ans = aqc / (ppent - lag(ppent))) %>% filter(fyear==1975)` (similar for 1973, etc) to see why your summarized values are what they end up being. – r2evans Jan 19 '23 at 14:51
  • thanks, I'll do so and let you know next. – Ayoze Alfageme Jan 19 '23 at 15:03