I have this dataset on an R/SQL Server:
name year
1 john 2010
2 john 2011
3 john 2013
4 jack 2015
5 jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
I am trying to add two columns that:
- Column 1: Looks at the "number of missing years between successive rows" for each person.
- Column 2: Sum the cumulative "number of missing years" for each person
For example - the first instance of each person will be 0, and then:
# note: in this specific example that I have created, "missing_ years" is the same as the "cumulative_missing_years"
name year missing_years cumulative_missing_years
1 john 2010 0 0
2 john 2011 0 0
3 john 2013 1 1
4 jack 2015 0 0
5 jack 2018 3 3
6 henry 2010 0 0
7 henry 2011 0 0
8 henry 2012 0 0
I think this can be done with a "grouped cumulative difference" and "grouped cumulative sums":
library(dplyr)
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
final = my_data %>%
group_by(name) %>%
arrange(year) %>%
mutate(missing_year) = year- lag(year, default = first(year)) %>%
mutate(cumulative_missing_years) = mutate( cumulative_missing_years = cumsum(cs))
But I am not sure if I am doing this correctly.
Ideally, I am looking for an SQL approach or an R approach (e.g. via DBPLYR) that can be used to interact with the dataset.
Can someone please suggest an approach for doing this?
Thank you!