Calculating sums of observation in time intervals in a df

Question

I've posted this as another question, but realised I've got my sample data wrong.

I've got two separate datasets. df1 looks like this:

loc_ID  year  observations
nin212  2002  90
nin212  2003  98
nin212  2004  102
cha670  2001  18
cha670  2002  19
cha670  2003  21

df2 looks like this:

loc_ID  start_year  end_year  
nin212  2002        2003 
nin212  2003        2004
cha670  2001        2002
cha670  2002        2003

I want to calculate the number of observations in the time intervals (start_year to end_year) per loc_ID. In the example above, I would like to achieve this final dataset:

loc_ID  start_year  end_year observations 
nin212  2002        2003     188
nin212  2003        2004     200 
cha670  2001        2002     37
cha670  2002        2003     40

How could I do this?

score 1 · Accepted Answer · answered Mar 01 '21 at 17:40

We can do a non-equi join

library(data.table)
setDT(df2)[, observations := setDT(df1)[df2, sum(observations),
  on = .(loc_ID, year >= start_year, year <= end_year),
     by = .EACHI]$V1]

-output

df2
#   loc_ID start_year end_year observations
#1: nin212       2002     2003          188
#2: nin212       2003     2004          200
#3: cha670       2001     2002           37
#4: cha670       2002     2003           40

data

structure(list(loc_ID = c("nin212", "nin212", "nin212", "cha670", 
"cha670", "cha670"), year = c(2002L, 2003L, 2004L, 2001L, 2002L, 
2003L), observations = c(90L, 98L, 102L, 18L, 19L, 21L)), 
class = "data.frame", row.names = c(NA, 
-6L))
> dput(df2)
structure(list(loc_ID = c("nin212", "nin212", "cha670", "cha670"
), start_year = c(2002L, 2003L, 2001L, 2002L), end_year = c(2003L, 
2004L, 2002L, 2003L)), class = "data.frame", row.names = c(NA, 
-4L))

Brilliant :) Thank you for help again @akrun! I just want to ask, what is the `by = .EACHI` doing? Also why `$V1` ? — fifigoblin, Mar 01 '21 at 17:52
@fifigoblin It is because I am doing a join from 'df1' with 'df2', so the column by default if it is not named will start from V1 (`sum(observations)`). That column is extracted `$V1`), also, if we don't specify `by`, it would not do the sum by each group — akrun, Mar 01 '21 at 17:54

Calculating sums of observation in time intervals in a df

1 Answers1

data