I have a dataframe containing hundreds of thousands rows, but it can be exemplified as here below:
> mydata
ID TYPE HEIGHT WEIGHT
1 20 6 194 77.1
2 20 2 175 63.5
3 20 6 197 59.6
4 20 1 185 74.3
5 20 1 162 94.4
6 21 1 188 58.9
7 21 6 182 81.2
8 21 6 169 82.8
9 21 2 151 78.5
here's the code to reproduce it:
mydata <- data.frame(ID=c(20,20,20,20,20,21,21,21,21),
TYPE=(c(6,2,6,1,1,1,6,6,2)),
HEIGHT=c(194,175,197,185,162,188,182,169,151),
WEIGHT=c(77.1,63.5,59.6,74.3,94.4,58.9,81.2,82.8,78.5))
What I need to do is: for each ID
, calculate the sum of the WEIGHTS
by TYPE
, but only for those elements (in the same ID) that have a HEIGHT
higher than the one contained in the current row.
The new dataframe should then contain three more columns (one for each TYPE
), and should eventually look like this:
> mydata_new
ID TYPE HEIGHT WEIGHT SUM.W.TYPE6 SUM.W.TYPE2 SUM.W.TYPE1
1 20 6 194 77.1 59.6 0.0 0.0
2 20 2 175 63.5 136.7 0.0 74.3
3 20 6 197 59.6 0.0 0.0 0.0
4 20 1 185 74.3 136.7 0.0 0.0
5 20 1 162 94.4 136.7 63.5 74.3
6 21 1 188 58.9 0.0 0.0 0.0
7 21 6 182 81.2 0.0 0.0 58.9
8 21 6 169 82.8 81.2 0.0 59.9
9 21 2 151 78.5 164.0 0.0 58.9
If possible, I would like to avoid to go through each row with a loop, as given my extensive dataset, it would take too long. Any smart solution? Perhaps using some suitable packages such as dplyr
, data.table
, or simply using apply
or sapply
?
I want to understand how to create a cumulative sum which is based on the value in different columns of the same row, but is also dependent on separate groupings (i.e. TYPE
).