0

I'm fairly new to R and have gone through some tutorials. What I'd like to do is find a good method of joining data onto itself based on some conditions.

In this case what I want to do is pick an arbitrary length of lag and create a rolling window. For example, if the lag = 1 and the window width = 2, I want to roll up the 2 months that are 1 month previous for each month, if they exist.

If I start with a data table like this:

mytable = data.table(Month = c(6, 5, 4, 6, 5), Year = c(2016, 2016, 2016, 2016, 2016), Company = c('Kellog', 'Kellog', 'General Mills', 'General Mills', 'General Mills'), ProducedCereals = c(6, 3, 12, 5, 7), CommercialsShown = c(12, 15, 4, 20, 19))

Month Year Company   ProducedCereals CommercialsShown
  6   2016  Kellog         6              12
  5   2016  Kellog         3              15
  4   2016  Kellog        12               4
  6   2016  General Mills  5              20
  5   2016  General Mills  7              19

The table with the calculated fields might look like this:

Month Year Company   ProducedCereals CommercialsShown
  6   2016  Kellog        15              19
  5   2016  Kellog        12               4
  4   2016  Kellog        NA              NA
  6   2016  General Mills  7              19
  5   2016  General Mills NA              NA

I've tried rollapply() with a width of a list, but it seems to be contingent on the data being a regular time series. However, mine isn't. It needs to be grouped by Company, and some rows might be missing. It further needs to take the previous n rows based on the Month and Year fields.

I realize a workaround might be to render the data so the operation is performed for each Company subset and inject dummy data for months missing in the middle, but I was thinking a better way probably exists.

I tried the following approach, which applies a lag and rolling window, but without respect to the month, year, and company.

newthing <- lapply(mytable[,c('ProducedCereals'),with=F], function(x) rollapply(x, width=list(2:3),sum,align='left',fill=NA))
shiggity
  • 531
  • 4
  • 12
  • 1
    See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 re how to make a good reproducible example (that can be copy-pasted into a new R session and run). – Frank Aug 07 '17 at 15:41
  • 1
    Well I have expected output and what I feel is a relatively succinct explanation of the problem, so you want me to add input data in R? You got it. Edited – shiggity Aug 07 '17 at 15:59
  • 1
    Ok thanks. Dunno how quickly someone will post an answer, but in the meantime: I guess a non equi join should work https://stackoverflow.com/questions/44406040/sum-over-past-window-size-dates-per-group/44407291#44407291 assuming you use a yearmon variable instead of two separate columns. – Frank Aug 07 '17 at 16:09
  • 1
    Great! I don't expect someone else to do the work for me, I just needed some advice on where to look. I wasn't aware of non-equi joins but that looks promising. I also am editing to provide the approach I tried, which rolls up the data but not by month, year, and company. – shiggity Aug 07 '17 at 16:12

3 Answers3

2

1) Using the data defined in the Note at the end use rollapply as shown below. nms is the names of the columns to perform the rolling window calculation over or it could be specified as just the column indexes (i.e. nms <- 4:5). Sum is like sum except that it will return NA, instead of 0, if given a series which is entirely NA and otherwise it performs sum(X, na.rm = TRUE). Note that the NA values added in roll are so that the series is not shorter than the window width.

library(data.table)
library(zoo)

k <- 2 # prior two months

Sum <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
roll <- function(x) rollapply(c(x, rep(NA, k)), list(1:k), Sum)
nms <- names(mytable)[4:5]

mytable[, (nms) := lapply(.SD, roll), .SDcols = nms, by = "Company"]

giving:

> mytable
   Month Year       Company ProducedCereals CommercialsShown
1:     6 2016        Kellog              15               19
2:     5 2016        Kellog              12                4
3:     4 2016        Kellog              NA               NA
4:     6 2016 General Mills               7               19
5:     5 2016 General Mills              NA               NA

1a) In a comment the situation is mentiond where there are missing rows and only the most recent two calendar months prior to the current row are to be used so fewer than 2 rows might be used in any sum.

It will be convenient in this case to sort the data frame first in order of Company and then date in ascending order which implies that we want right alignment rather than left in rollapply.

We pass a zoo object with yearmon index to rollapply so that we have a time index that Sum can check to subset the input to the desired window. We use a window size of 3 and only sum the values in the window whose times lie within specified bounds. We will specify coredata = FALSE to rollapply in order that the data and index be passed to the rollapply function and not just the data.

k <- 2 # prior 2 months

# inputs zoo object x, subsets it to specified window and sums
Sum2 <- function(x) {
  w <- window(x, start = end(x) - k/12, end = end(x) - 1/12)
  if (length(w) == 0 || all(is.na(w))) NA_real_ else sum(w, na.rm = TRUE)
}

nms <- names(mytable)[4:5]

setkey(mytable, Company, Year, Month) # sort

# create zoo object from arguments and run rollapplyr using Sum2
roll2 <- function(x, year, month) {
  z <- zoo(x, as.yearmon(year + (month - 1)/12))
  coredata(rollapplyr(z, k+1, Sum2, coredata = FALSE, partial = TRUE))
}

mytable[, (nms) := lapply(.SD, roll2, Year, Month), .SDcols = nms, by = "Company"]

giving:

> mytable
    Month Year       Company ProducedCereals CommercialsShown
1:     5 2016 General Mills              NA               NA
2:     6 2016 General Mills               7               19
3:     4 2016        Kellog              NA               NA
4:     5 2016        Kellog              12                4
5:     6 2016        Kellog              15              

1b) Another approach to missing rows is to conver the data to long form and then to a rectangular form filling in missing cells with NA. That will work as long as the same month and year is not missing in every company.

k <- 2 # sum over k prior months
m <- melt(mytable, id = 1:3)
dd <- as.data.frame.table(tapply(m$value, m[, 1:4, with = FALSE], c), 
    responseName = "value")
Sum1 <- function(x) {
   x <- head(x, -1)
   if (length(x) == 0 || all(is.na(x))) NA_real_ else sum(x, na.rm = TRUE)
}
setDT(dd)[, value := rollapplyr(value, k+1, Sum1, partial = TRUE), 
     by = .(Company, variable)]
dc <- as.data.table(dcast(... ~ variable, data = dd, value = "value"))
setkey(dc, Company, Year, Month)
dc

giving:

   Month Year       Company ProducedCereals CommercialsShown
1:     4 2016 General Mills              NA               NA
2:     5 2016 General Mills              NA               NA
3:     6 2016 General Mills               7               19
4:     4 2016        Kellog              NA               NA
5:     5 2016        Kellog              12                4
6:     6 2016        Kellog              15               19

2) Another possibility is to convert mytable to the zoo object z splitting mytable by Company and then use rollapply on that. mytable is again as shown in the Note at the end. Sum is from (1).

k <- 2 # prior 2 months

ym <- function(m, y) as.yearmon(paste(m, y), format = "%m %Y")
z <- read.zoo(mytable, index = 1:2, split = k+1, FUN = ym)

Sum <- function(x) if (all(is.na(x))) NA else sum(x, na.rm = TRUE)
rollapply(z, list(-1:-k), Sum, partial = TRUE, fill = NA) 

giving:

         ProducedCereals.General Mills CommercialsShown.General Mills
Apr 2016                            NA                             NA
May 2016                            NA                             NA
Jun 2016                             7                             19
         ProducedCereals.Kellog CommercialsShown.Kellog
Apr 2016                     NA                      NA
May 2016                     12                       4
Jun 2016                     15                      19

Note: The code in the question does not generate the data displayed in the question so we used this instead for the data.table mytable:

library(data.table)
mytable <-
structure(list(Month = c(6, 5, 4, 6, 5), Year = c(2016, 2016, 
2016, 2016, 2016), Company = c("Kellog", "Kellog", "Kellog", 
"General Mills", "General Mills"), ProducedCereals = c(6, 3, 
12, 5, 7), CommercialsShown = c(12, 15, 4, 20, 19)), .Names = c("Month", 
"Year", "Company", "ProducedCereals", "CommercialsShown"), row.names = c(NA, 
-5L), class = "data.frame")
mytable <- as.data.table(mytable)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks, this is much different than what I came up with. I noticed a coercion error when I used different values in the list() in solution #1, e.g. list(2:3). When I changed fill = NA to a numeric value such as fill = -1 it solved the problem, even though in your example NA is in the output as expected. I don't understand why that would be the case. – shiggity Aug 07 '17 at 20:02
  • I also noticed that solution #1 is contingent on the data being a regular time series which is good to note – shiggity Aug 07 '17 at 20:09
  • Well... I did write "...so the operation is performed for each Company subset and inject dummy data for **months missing in the middle**" but I suppose I could have written it more explicitly – shiggity Aug 08 '17 at 14:30
  • Sorry it wasn't clear to me that this was part of the question as the sample data had no missing rows. At any rate 1a and 1b in my answer do address this anyways. – G. Grothendieck Aug 08 '17 at 14:35
1

I tried a non-equi join--it didn't like a join with itself so I copied the table. Although I'm sure this isn't the best way, it does handle missing months.

lag = 2 # The lag in number of months
block = 3 # The number of contiguous months to roll up

mytable = data.table(Month = c(6, 5, 4, 6, 5), Year = c(2016, 2016, 2016, 2016, 2016), Company = c('Kellog', 'Kellog', 'General Mills', 'General Mills', 'General Mills'), ProducedCereals = c(6, 3, 12, 5, 7), CommercialsShown = c(12, 15, 4, 20, 19))

setDT(mytable)[, "MonthsSinceEpoch" := {
  MonthsSinceEpoch = (Year - 2000) * 12 + Month
 .(MonthsSinceEpoch)
}]

mytable2 <- mytable

setDT(mytable2)[, "EndMonths" := {
  EndMonths = MonthsSinceEpoch - lag
  .(EndMonths)
}]
setDT(mytable2)[, "StartMonths" := {
  StartMonths = MonthsSinceEpoch - lag - block + 1
  .(StartMonths)
}]

mytable3 <- mytable[mytable2, on = .(Company, MonthsSinceEpoch >= StartMonths, MonthsSinceEpoch <= EndMonths),
                   .(CommercialsShown = sum(CommercialsShown), ProducedCereals = sum(ProducedCereals)),
                   by=.EACHI]

mytable3 <- mytable3[order(rank(Company), -MonthsSinceEpoch)]
mytable3
shiggity
  • 531
  • 4
  • 12
0

To perform this procedure on a data.table, you must use the data.table package and the frollapply function, as informed below.

dt[, x.value.sum := frollapply(x = x, n = 2, sum, fill = NA, align = "right", na.rm =TRUE), by = ID]

Where: dt the data.table x.value.sum the variable you will create inside data.table x the variable that will be accumulated in a window of 2 n the size of the window sum is the function, in this case sum ID variable you want to group

Polyandra
  • 1
  • 1