0

I have this dataset on an R/SQL Server:

   name year
1  john 2010
2  john 2011
3  john 2013
4  jack 2015
5  jack 2018
6 henry 2010
7 henry 2011
8 henry 2012

I am trying to add two columns that:

  • Column 1: Looks at the "number of missing years between successive rows" for each person.
  • Column 2: Sum the cumulative "number of missing years" for each person

For example - the first instance of each person will be 0, and then:

# note: in this specific example that I have created, "missing_ years" is the same as the "cumulative_missing_years"

   name year missing_years cumulative_missing_years
1  john 2010             0                        0
2  john 2011             0                        0
3  john 2013             1                        1
4  jack 2015             0                        0
5  jack 2018             3                        3
6 henry 2010             0                        0
7 henry 2011             0                        0
8 henry 2012             0                        0

I think this can be done with a "grouped cumulative difference" and "grouped cumulative sums":

 library(dplyr)
 library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

# https://stackoverflow.com/questions/30606360/subtract-value-from-previous-row-by-group
final = my_data %>%
    group_by(name) %>%
    arrange(year) %>%
    mutate(missing_year) = year- lag(year, default = first(year))  %>%
 mutate(cumulative_missing_years) = mutate( cumulative_missing_years = cumsum(cs))

But I am not sure if I am doing this correctly.

Ideally, I am looking for an SQL approach or an R approach (e.g. via DBPLYR) that can be used to interact with the dataset.

Can someone please suggest an approach for doing this?

Thank you!

stats_noob
  • 5,401
  • 4
  • 27
  • 83

2 Answers2

2

Using the data in the Note at the end perform a left self join to get the next year of the same name and then subtract and take the cumulative sum.

library(sqldf)

sqldf("select a.*, 
    coalesce(min(b.year) - a.year - 1, 0) as missing,
    sum(coalesce(min(b.year) - a.year - 1, 0)) over 
      (partition by a.name order by a.year) as sum
  from DF a 
  left join DF b on a.name = b.name and a.year < b.year
  group by a.name, a.year
  order by a.name, a.year")

giving:

   name year missing sum
1 henry 2010       0   0
2 henry 2011       0   0
3 henry 2012       0   0
4  jack 2015       2   2
5  jack 2018       0   2
6  john 2010       0   0
7  john 2011       1   1
8  john 2013       0   1

Note

Lines <- "name year
1  john 2010
2  john 2011
3  john 2013
4  jack 2015
5  jack 2018
6 henry 2010
7 henry 2011
8 henry 2012
"
DF <- read.table(text = Lines)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • @ G. Grothendieck : Thank you so much! Can this code also work outside the sqldf library? E.g. with the DBI library, or an SQL server itself? – stats_noob Oct 26 '22 at 21:35
  • There can be differences between databases but I tried to keep it as generic as possible so it would likely work with other databases but you will have to try it to really know. There is nothing specific to sqldf in the SQL statement. – G. Grothendieck Oct 26 '22 at 21:44
1

I hope this helps

name <- c(rep("John", 3), rep("jack", 2),  rep("henry", 3) )

year <- c(2010, 2011, 2013, 2015, 2018, 2010, 2011, 2012)


dt <- data.frame(name = name, year = year)

# first group the data by name then order by year then mutate
dt <- dt %>% 
  group_by(name) %>% 
  arrange(year, .by_group = TRUE) %>% 
  mutate( mis_yr = if_else(is.na(year - lag(year, n = 1L) -1), 0, 
                           year - lag(year, n = 1L) -1)   ,
          cum_yr = cumsum(mis_yr)
          ) %>% 
  ungroup()

Hare is the outcome

name   year mis_yr cum_yr
  <chr> <dbl>  <dbl>  <dbl>
1 henry  2010      0      0
2 henry  2011      0      0
3 henry  2012      0      0
4 jack   2015      0      0
5 jack   2018      2      2
6 John   2010      0      0
7 John   2011      0      0
8 John   2013      1      1
Ahad Zaman
  • 321
  • 1
  • 9