0

I am trying to create a column in a data.frame or data.table with two conditions. The difference to the posts I have seen and which I have tried to modify below is that I do not have 'value' for the conditions but the conditions depend on other variables in the data.frame.

Let's assume this is my data frame:

mydf <- data.frame (Year = c(2000, 2001, 2002, 2004, 2005,
                             2007, 2000, 2001, 2002, 2003,
                             2003, 2004, 2005, 2006, 2006, 2007),
                    Name = c("Tom", "Tom", "Tom", "Fred", "Gill",
                             "Fred", "Gill", "Gill", "Tom", "Tom",
                             "Fred", "Fred", "Gill", "Fred", "Gill", "Gill"))

I want to find out how many times the 3 subjects have experienced an event in the last 5 years. However, if the event dates go back more than 5 years, I do not want to include it. I thought I could do a sum of an indicator variable (set to 1 if the subject experienced the event in the year) while specifying something along the lines of Year < Year & Year >= Year-5. So basically sum the experiences for the year smaller than the focal year and larger than or equal to 5 years before the focal year.

I have create an indicator for summing and a variable for focal year - 5

mydf$Ind <- 1
mydf$Yearm5 <- mydf$Year-5

Then I convert to data table for speed (the original df has +60k obs)

library(data.table)
mydf <- data.table(mydf)

The issue now is that I cannot get the two conditions to work. The post I have seen seem to all know a specific value by which to subset (e.g. R data.table subsetting on multiple conditions.), but in my case the value changes from observation to observation (not sure if this means I need to do some looping?).

I thought I need something along the lines of:

mydf[, c("Exp"):= sum(Ind), by = c("Name")][Year < Year & Year >= Yearm5]

gives:

Empty data.table (0 rows) of 5 cols: Year,Name,Ind,Yearm5,Exp

Using just one condition

mydf1 <- mydf[, c("Exp"):= sum(Ind), by = c("Name")][Year >= Yearm5] 

gives the total experience so I am assuming that something is wrong with the Year < Year condition.

I am not quite sure what though. I have also tried to modify the suggestions in: how to cumulatively add values in one vector in R with not luck again something seems to be wrong with the way I specify the conditions.

library(dplyr)
mytest1 <- mydf %>%
           group_by(Name, Year) %>%
           filter(Year < Year & Year >= Yearm5) %>%
           mutate(Exp = sum(Ind))

The result should look as follows:

myresult <- data.frame (Year = c(2003, 2004, 2004, 2006,
                                 2007, 2000, 2001, 2005,
                                 2005, 2006, 2007, 2000,
                                 2001, 2002, 2002, 2003),
                        Name = c("Fred", "Fred", "Fred", "Fred",
                                 "Fred", "Gill", "Gill", "Gill",
                                 "Gill", "Gill", "Gill", "Tom",
                                 "Tom", "Tom", "Tom", "Tom"),
                        Ind = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
                        Exp = c(0, 1, 1, 3, 4, 0, 1, 1, 1, 2, 3, 0, 1, 2, 2, 4),
                        Yearm5 = c(1998, 1999, 1999, 2001, 2002,
                                   1995, 1996, 2000, 2000, 2001,
                                   2002, 1995, 1996, 1996, 1997, 1998))

Any help or pointers would be appreciated!

Community
  • 1
  • 1
Rkook
  • 63
  • 1
  • 11

3 Answers3

3

Here's a more data.table approach, using roll.

setDT(mydf)

# this is our desired end point
boundary = mydf[, list(Name, year.end = Year + 4)]

# set the key for the following merges
setkey(mydf, Name, Year)
setkey(boundary, Name, year.end)

# add indices that will keep track of the positions to compute deltas
mydf[, idx := .I]
boundary[, idx := .I]

# merge, rolling to match the end correctly, and then subtract the indices
# to get the desired delta.
# Note that we need to unique data because of duplicates.
# Depending on data you may also need to add `allow.cartesian = TRUE`.
# Final note - in data.table <= 1.9.2 you should omit the `by = .EACHI` part.

mydf[unique(boundary)[unique(mydf), list(Exp = i.idx - idx),
                      roll = -Inf, by = .EACHI]]
#    Year Name idx Exp
# 1: 2003 Fred   1   0
# 2: 2004 Fred   2   1
# 3: 2004 Fred   3   1
# 4: 2006 Fred   4   3
# 5: 2007 Fred   5   4
# 6: 2000 Gill   6   0
# 7: 2001 Gill   7   1
# 8: 2005 Gill   8   1
# 9: 2005 Gill   9   1
#10: 2006 Gill  10   2
#11: 2007 Gill  11   3
#12: 2000  Tom  12   0
#13: 2001  Tom  13   1
#14: 2002  Tom  14   2
#15: 2002  Tom  15   2
#16: 2003  Tom  16   4
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Brilliant. I thought there would be a way using `roll`, although I couldn't find it. – mnel Aug 07 '14 at 23:16
  • Thanks a lot for this approach, works perfectly on my data as well. I am not quite sure I understand how `roll` works and I could not find the relevant documentation. In the spirit of learning would it be possible to get a link to some documentation so I can apply it in the future? – Rkook Aug 09 '14 at 01:05
  • Sorry, I was premature in posting this I just found it at http://cran.r-project.org/web/packages/data.table/data.table.pdf on the bottom of page 5. – Rkook Aug 09 '14 at 01:12
2

With data.table, I think the syntax you are looking for should be like this:

setDT(mydf)
mydf[ , Exp := rank(x=Year,ties.method="min")-1, by=Name]
Mike.Gahan
  • 4,565
  • 23
  • 39
  • Thanks a lot for the quick response and that suggestion. I had not though of using the rank function before. However, it does not give the desired result (I get the same result using `cumsum`). Note that "Gill" should have `Exp` values of 3 and 4 for 2006 and 2007 respectively because 2000 and 2001 are more than 5 years before 2006 and 2007 so the experience should not be included in the sum. Looking at `mydf <- mydf[with(mydf, order(Name,Year)),]` makes it clearer. – Rkook Aug 07 '14 at 03:50
  • @ Mike.Gahan: My comment above is incorrect, the values for `Exp` should be 2 and 3 for 2006 and 2007 respectively, even basic maths does not seem to work anymore :(. – Rkook Aug 07 '14 at 05:42
2

Here is an approach using rollapply and data.table

library(zoo)
 setDT(mydf)
 setkey(mydf, Name,Year)
 # create a data.table that has all Years and incidences including the 5 year window 
 # and sum up the number of incidences per year for each subject 
m <- mydf[CJ(unique(Name),seq(min(Year)-5, max(Year))),allow.cartesian=TRUE][,
            list(Ind = unique(Ind), I2 = sum(Ind,na.rm=TRUE)),
            keyby=list(Name,Year)]
# use rollapply over this larger data.table to get the number of
# incidences in the previous 5 years (not including this year (hence head(x,-1))
m[,Exp := rollapply(I2, 5, function(x) sum(head(x,-1)), 
                    align = 'right', fill=0),by=Name]
# join with the original to create your required data
m[mydf, !'I2']
   Name Year Ind Exp
#  1: Fred 2003   1   0
#  2: Fred 2004   1   1
#  3: Fred 2004   1   1
#  4: Fred 2006   1   3
#  5: Fred 2007   1   4
#  6: Gill 2000   1   0
#  7: Gill 2001   1   1
#  8: Gill 2005   1   1
#  9: Gill 2005   1   1
# 10: Gill 2006   1   2
# 11: Gill 2007   1   3
# 12:  Tom 2000   1   0
# 13:  Tom 2001   1   1
# 14:  Tom 2002   1   2
# 15:  Tom 2002   1   2
# 16:  Tom 2003   1   4
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
mnel
  • 113,303
  • 27
  • 265
  • 254
  • Thanks a lot for that, works a charm and it quite fast too. I would not have come up with this idea by myself. I guess just as a clarification for other users `require(zoo)` needs to be called before running the code. – Rkook Aug 07 '14 at 12:35
  • @Rkook -- indeed and fixed. \ – mnel Aug 07 '14 at 23:15