3

I have a dataframe containing hundreds of thousands rows, but it can be exemplified as here below:

> mydata
  ID TYPE HEIGHT WEIGHT
1 20    6    194   77.1
2 20    2    175   63.5
3 20    6    197   59.6
4 20    1    185   74.3
5 20    1    162   94.4
6 21    1    188   58.9
7 21    6    182   81.2
8 21    6    169   82.8
9 21    2    151   78.5

here's the code to reproduce it:

mydata <- data.frame(ID=c(20,20,20,20,20,21,21,21,21), 
                     TYPE=(c(6,2,6,1,1,1,6,6,2)), 
                     HEIGHT=c(194,175,197,185,162,188,182,169,151), 
                     WEIGHT=c(77.1,63.5,59.6,74.3,94.4,58.9,81.2,82.8,78.5))

What I need to do is: for each ID, calculate the sum of the WEIGHTS by TYPE, but only for those elements (in the same ID) that have a HEIGHT higher than the one contained in the current row.

The new dataframe should then contain three more columns (one for each TYPE), and should eventually look like this:

> mydata_new
  ID TYPE HEIGHT WEIGHT SUM.W.TYPE6 SUM.W.TYPE2 SUM.W.TYPE1
1 20    6    194   77.1        59.6         0.0         0.0
2 20    2    175   63.5       136.7         0.0        74.3
3 20    6    197   59.6         0.0         0.0         0.0
4 20    1    185   74.3       136.7         0.0         0.0
5 20    1    162   94.4       136.7        63.5        74.3
6 21    1    188   58.9         0.0         0.0         0.0
7 21    6    182   81.2         0.0         0.0        58.9
8 21    6    169   82.8        81.2         0.0        59.9
9 21    2    151   78.5       164.0         0.0        58.9

If possible, I would like to avoid to go through each row with a loop, as given my extensive dataset, it would take too long. Any smart solution? Perhaps using some suitable packages such as dplyr, data.table, or simply using apply or sapply?

I want to understand how to create a cumulative sum which is based on the value in different columns of the same row, but is also dependent on separate groupings (i.e. TYPE).

bouncyball
  • 10,631
  • 19
  • 31
refroll
  • 131
  • 1
  • 9
  • Is it possible to have two heights that are equal for the same `ID`? – bouncyball Aug 19 '16 at 13:00
  • Yes, it is possible. In that case, if they are the only two elements in the same `ID` and have the same `TYPE`, their SUM.W.TYPE would be simply 0.0, because there are no other elements with the same `TYPE` with higher height – refroll Aug 19 '16 at 13:05
  • Why is there an entry in row 5 for `SUM.W.TYPE2`? – bouncyball Aug 19 '16 at 13:07
  • Because within the same `ID` the element in row 2 (which is a TYPE 2) has a height which is higher than the element TYPE 1 in row 5. Thus `SUM.W.TYPE2` reports the `WEIGHT` of the TYPE 2 element. – refroll Aug 19 '16 at 13:13
  • 2
    I'm not entirely convinced this is an exact duplicate of this question. I'm going to post an approach in the next comment. Obviously, the formatting is going to be ugly...but I can't post an answer. – bouncyball Aug 19 '16 at 13:36
  • `sapply(c(1, 2, 6), function(y) apply(mydata, 1, function(x, TYPE = y) sum(ifelse(mydata[mydata$ID == x[1] & mydata$TYPE == TYPE,]$HEIGHT > x[3], mydata[mydata$ID == x[1] & mydata$TYPE == TYPE,]$WEIGHT, 0))) )` – bouncyball Aug 19 '16 at 13:36
  • Thanks @bouncyball !! I will have to modify it a bit for my complete dataset (which includes more variables) but your approach works and it is quite efficient! Its a shame that they marked the question as duplicate... – refroll Aug 19 '16 at 14:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121355/discussion-between-bouncyball-and-refroll). – bouncyball Aug 19 '16 at 14:06
  • 1
    @akrun I don't believe this to be a dupe (at least not of the linked question). The linked question is conditional on previous values whereas this one is conditional on another column. – Dean MacGregor Aug 19 '16 at 16:41
  • @DeanMacGregor Okay, I opened it. – akrun Aug 19 '16 at 16:43

5 Answers5

2

As suggested in the OP, a cumulative sum works here:

library(data.table)
setDT(mydata)

ut = sort(unique(mydata$TYPE))
mydata[order(-HEIGHT), paste0("sum_",ut) :=  lapply(ut, 
  function(x) shift(cumsum( WEIGHT*(TYPE==x) ), fill=0) 
), by=ID]

   ID TYPE HEIGHT WEIGHT sum_1 sum_2 sum_6
1: 20    6    194   77.1   0.0   0.0  59.6
2: 20    2    175   63.5  74.3   0.0 136.7
3: 20    6    197   59.6   0.0   0.0   0.0
4: 20    1    185   74.3   0.0   0.0 136.7
5: 20    1    162   94.4  74.3  63.5 136.7
6: 21    1    188   58.9   0.0   0.0   0.0
7: 21    6    182   81.2  58.9   0.0   0.0
8: 21    6    169   82.8  58.9   0.0  81.2
9: 21    2    151   78.5  58.9   0.0 164.0

Repeating height measurements. So far, this only works if all heights are distinct within each ID (as in the OP's current example). The OP mentioned in a comment that heights may repeat, however. Thanks to @DeanMacGregor, here's an extension for that case:

# run the code above, and then...
mydata[order(-HEIGHT), paste0('sum_',ut) := 
  .SD[.N]
, by=.(ID,TYPE,HEIGHT), .SDcols=paste0('sum_',ut)]

Or do it in one by step:

ut = sort(unique(mydata$TYPE))
mydata[order(-HEIGHT), paste0("sum_",ut) := {
  sd = lapply(ut, function(x) shift(cumsum( WEIGHT*(TYPE==x) ), fill=0))
  setDT(sd)[, .SD[1L], by=.(HEIGHT,TYPE)][, c("HEIGHT","TYPE") := NULL]
}, by=ID]
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 2
    Well I thought @Arun crushed it but this only takes 0.148 and it doesn't require switching to the devel version. – Dean MacGregor Aug 19 '16 at 23:13
  • There's one flaw in this. OP said sometimes there can be same ID, HEIGHT combos and with using the `order` methodology it will include instances where the HEIGHT is the same when it should be excluded. – Dean MacGregor Aug 19 '16 at 23:26
  • @Dean Oh, that's what I get for not paying attn to comments. OP should edit those things into the post. Maybe if I add a `*!duplicated(HEIGHT)` to the condition in the cumsum it would do it? I'm not really clear on the example + desired output for that extension... – Frank Aug 19 '16 at 23:29
  • Oh, I get it, so it should only sum heights strictly greater than the current height... hm, not sure this approach can be extended to that... maybe with `!duplicated(HEIGHT, fromLast=TRUE)`...? No, not that either. Might delete if the OP updates and I can't think of an extension. – Frank Aug 19 '16 at 23:30
  • 1
    Actually it's an easy fix. Just add this line after what you already have.... `mydata[,(paste0('sum_',ut)):=lapply(.SD,min),by=c('ID','TYPE','HEIGHT'),.SDcols=paste0('sum_',ut)]` – Dean MacGregor Aug 19 '16 at 23:35
  • @Dean Thanks; I definitely was barking up the wrong tree. I edited your approach a little bit, but I think it does the same thing...? Also added another way, but I'm not sure if it's right. – Frank Aug 19 '16 at 23:55
1

Here is an alternative data.table solution.

mydata[, c(.SD, setNames(lapply(sort(unique(TYPE)), 
                         function(type) apply(outer(HEIGHT, HEIGHT, "<"), 1, 
                                        function(higher) sum(WEIGHT[TYPE == type & higher]))), 
                paste0("SUM.W.TYPE", sort(unique(TYPE))))), ID]

#    ID TYPE HEIGHT WEIGHT SUM.W.TYPE1 SUM.W.TYPE2 SUM.W.TYPE6
# 1: 20    6    194   77.1         0.0         0.0        59.6
# 2: 20    2    175   63.5        74.3         0.0       136.7
# 3: 20    6    197   59.6         0.0         0.0         0.0
# 4: 20    1    185   74.3         0.0         0.0       136.7
# 5: 20    1    162   94.4        74.3        63.5       136.7
# 6: 21    1    188   58.9         0.0         0.0         0.0
# 7: 21    6    182   81.2        58.9         0.0         0.0
# 8: 21    6    169   82.8        58.9         0.0        81.2
# 9: 21    2    151   78.5        58.9         0.0       164.0

used outer function to create a comparison matrix to find out the index where there is a larger height row and subset the weight combined with the type to get the sum.

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • You could use `{}` in the `j` and assign `sort(unique(TYPE))` to a variable and then use the variable name instead. That way it's only sorting once instead of twice. – Dean MacGregor Aug 19 '16 at 18:52
  • I guess that doesn't help that much since you're taking the unique's first now that I think about – Dean MacGregor Aug 19 '16 at 18:54
1

Here's another data.table solution using the recently implemented non-equi joins feature. You'll need to grab the development version of data.table, v1.9.7

require(data.table) # v1.9.7
setDT(mydata) # convert data.frame to data.table without copying

foo <- function(x, val) {
    y = x[TYPE == val]
    y[x, on = .(ID, HEIGHT > HEIGHT), 
        .(sum_val = sum(WEIGHT, na.rm = TRUE)), 
        by = .EACHI
     ][, sum_val]
}
for (type in unique(mydata$TYPE)) {
    cat("type = ", sprintf("%2.0f", type), "\n", sep="")
    mydata[, paste("sum", type, sep="_") := foo(mydata, type)][]
}
mydata
#    ID TYPE HEIGHT WEIGHT sum_6 sum_2 sum_1
# 1: 20    6    194   77.1  59.6   0.0   0.0
# 2: 20    2    175   63.5 136.7   0.0  74.3
# 3: 20    6    197   59.6   0.0   0.0   0.0
# 4: 20    1    185   74.3 136.7   0.0   0.0
# 5: 20    1    162   94.4 136.7  63.5  74.3
# 6: 21    1    188   58.9   0.0   0.0   0.0
# 7: 21    6    182   81.2   0.0   0.0  58.9
# 8: 21    6    169   82.8  81.2   0.0  58.9
# 9: 21    2    151   78.5 164.0   0.0  58.9

On the 300K row dataset from @Dean, it takes ~19s or ~1s per TYPE.

Arun
  • 116,683
  • 26
  • 284
  • 387
0

Posting my answer which was originally a comment:

#initializations
mydata$Sum.W.Type1 <- 0
mydata$Sum.W.Type2 <- 0
mydata$Sum.W.Type6 <- 0

#assignment
mydata[,5:7] <- 
sapply(c(1, 2, 6), function(y) 
apply(mydata, 1, 
function(x, TYPE = y) 
sum(ifelse(mydata[mydata$ID == x[1] & mydata$TYPE == TYPE,]$HEIGHT > x[3],  
mydata[mydata$ID == x[1] & mydata$TYPE == TYPE,]$WEIGHT, 0))))

mydata
  ID TYPE HEIGHT WEIGHT Sum.W.Type1 Sum.W.Type2 Sum.W.Type6
1 20    6    194   77.1         0.0         0.0        59.6
2 20    2    175   63.5        74.3         0.0       136.7
3 20    6    197   59.6         0.0         0.0         0.0
4 20    1    185   74.3         0.0         0.0       136.7
5 20    1    162   94.4        74.3        63.5       136.7
6 21    1    188   58.9         0.0         0.0         0.0
7 21    6    182   81.2        58.9         0.0         0.0
8 21    6    169   82.8        58.9         0.0        81.2
9 21    2    151   78.5        58.9         0.0       164.0

Depending on the scope of your data, the initializations will change as well as the number of columns requiring assignment. But, this should be good enough to get you there.

bouncyball
  • 10,631
  • 19
  • 31
0

Here's an approach similar to @Psidom.

library(data.table)
setDT(mydata)
mydata_new<-mydata[,c(.SD,{ types<-(unique(TYPE));
          setNames(lapply(types, function(curtype) {
            heights<-(HEIGHT);
            sapply(heights, function(curheight) {
              sum(WEIGHT[HEIGHT>curheight & TYPE==curtype])
            })
          }),paste0('SUM.W.TYPE',types))}),by='ID']

The difference is that mine doesn't use outer which, I suspect, is a memory/performance hog.

Here's a benchmark:

library(data.table)
#create fake data with 300,000 rows
mydata <- data.frame(ID=rep(1:10,30000), 
                     TYPE=rep(1:20,each=15000), 
                     HEIGHT=as.integer(runif(300000,150,200)), 
                     WEIGHT=round(runif(300000,50,100),1))

setDT(mydata)

system.time({
    mydata_new<-mydata[,c(.SD,{ types<-(unique(TYPE));
              setNames(lapply(types, function(curtype) {
                heights<-(HEIGHT);
                sapply(heights, function(curheight) {
                  sum(WEIGHT[HEIGHT>curheight & TYPE==curtype])
                })
              }),paste0('SUM.W.TYPE',types))}),by='ID']
})
    #user   system  elapsed 
#1125.244    1.460 1127.665 


system.time({
  psidata<-mydata[, c(.SD, setNames(lapply(sort(unique(TYPE)), 
                                  function(type) apply(outer(HEIGHT, HEIGHT, "<"), 1, 
                                                       function(higher) sum(WEIGHT[TYPE == type & higher]))), 
                           paste0("SUM.W.TYPE", sort(unique(TYPE))))), ID]
})
    #user   system  elapsed 
#3854.596  731.272 4774.742 

all.equal(mydata_new, psidata)
#TRUE

system.time({
  frankdata<-copy(mydata)
  ut = sort(unique(mydata$TYPE))
  frankdata[order(-HEIGHT), paste0("sum_",ut) :=  lapply(ut, 
                                                         function(x) shift(cumsum( WEIGHT*(TYPE==x) ), fill=0) 
  ), by=ID]
})

   #user  system elapsed 
  #0.148   0.000   0.148 

Frank's is by far the winner in performance.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72