3

I would like to create a column equal to a running sum of data$Rating, given two conditions are true in column 3 and 4, specifically that data$Year< current year and data$ID is equal to current ID.

In words this should calculate the cumulative sum of ratings for each id up until the previous year. And it should do this for each row in the data frame (about 50,000 rows). Given the size of the data frame, I'd prefer not to loop, if this at all possible.

I've provided a short example of how this would look below...

> head(data[,c(3,4,13)])
  Year        ID        Rating  CumSum
1 2010        13578     2       0
2 2010        13579     1       0
3 2010        13575     3       0
4 2011        13575     4       3
5 2012        13578     3       2    
6 2012        13579     2       1
7 2012        13579     4       1

I'm coming from a spreadsheet background, so am still thinking in terms of SUMIFS, etc. (which would nicely solve my problem in Excel), so apologies if the language isn't precise.

user3725021
  • 566
  • 3
  • 14
  • 32

1 Answers1

1
data <- data.frame(Year = c( rep(2010, 3), 2011, rep(2012, 3) ),
                   ID = c(13578, 13579, 13575, 13575, 13578, 13579, 13579),
                   Rating = c(2, 1, 3, 4, 3, 2, 4))
data 

#   Year        ID        Rating  
# 1 2010        13578     2       
# 2 2010        13579     1       
# 3 2010        13575     3       
# 4 2011        13575     4       
# 5 2012        13578     3           
# 6 2012        13579     2       
# 7 2012        13579     4       
  • Create a column equal to a running sum of data$Rating where
    • data$Year < Year
    • data$ID == ID
  • This should calculate the cumulative sum of ratings for each ID up until the previous year

The desired output would be

data 

#   Year  ID     Rating    CumSum
# 1 2010  13578  2         2        
# 2 2010  13579  1         1
# 3 2010  13575  3         3
# 4 2011  13575  4         7
# 5 2012  13578  3         5  
# 6 2012  13579  2         3
# 7 2012  13579  4         7

This could be done like so

year <- 2014 # maximum year to include in cumsum
ID.values <- names(table(data$ID)) # get unique values of data$ID, sorted

# cumsum for 13575 rows, followed by cumsum for 13578 rows, ...
Rating.cumsum <- unlist(sapply(ID.values, 
                     function(x) cumsum(data$Rating[data$ID == x]))) 

# assign cumsum output to appropriate rows
data$cumsum[with(data, order(ID))] <- Rating.cumsum
nathanesau
  • 1,681
  • 16
  • 27