Sum of values greater than or equal too for each element in grouped dataframe (dplyr) R

Question

I have a relatively large dataframe (~2,000,000 rows) where, for each row, I need to, within that observation's group, calculate the sum of every value greater than or equal to the current row's value.

Here is a sample data frame:

sample_df = data.frame(
  group_id = c(1,1,1,1,2,2,2,2),
  value = c(10,12,14,12,8,8,21,10)
)

I currently have a very slow solution to do this using a loop and some filtering, however, having a faster solution would be much preferred. I've been trying to use dplyr however, I can not figure out how I can get the sum of the other observations after the data are grouped.

With the above toy example, here would be the desired output:

desired_output = data.frame(
  group_id = c(1,1,1,1,2,2,2,2),
  value = c(10,12,14,12,8,8,21,10),
  output = c(38,26,0,26,39,39,0,21)
)

Looking around for solutions to this that have already been posted, I haven't seen a clear answer which explains how one can compare each observation in a group to the other observations, filtered on some criteria, in that group. I'd prefer a dplyr-based solution but if there are efficient base-R or data.table solutions than I'd be equally grateful!

How come value 12 in group 1 gives you an output of 24? I would expect 26 — Frostic, Mar 22 '18 at 20:32
...Because I just messed up 2nd grade math when creating my output :(. Apologies for that mistake! — John K., Mar 22 '18 at 20:39

moodymudskipper · Accepted Answer · 2018-03-22T21:11:34.830

11

Using tidyverse. The trick is to use map_dbl to loop on every value.

library(tidyverse)
sample_df %>%
  group_by(group_id) %>%
  mutate(output= map_dbl(value,~sum(value[value>=.x]))-value) %>%
  ungroup

# A tibble: 8 x 3
  group_id value output
     <dbl> <dbl>  <dbl>
1        1    10     38
2        1    12     26
3        1    14      0
4        1    12     26
5        2     8     39
6        2     8     39
7        2    21      0
8        2    10     21

value in the mutate line is your value 'subcolumn' (the group), while .x is its element you're looping on.

a base solution

within(sample_df,output <- unlist(tapply(
  value,group_id,function(x) sapply(x,function(y) sum(x[x>=y])-y))))
#   group_id value output
# 1        1    10     38
# 2        1    12     26
# 3        1    14      0
# 4        1    12     26
# 5        2     8     39
# 6        2     8     39
# 7        2    21      0
# 8        2    10     21

edited Mar 22 '18 at 21:11

answered Mar 22 '18 at 20:31

moodymudskipper

46,417
11
121
167

This is an incredibly elegant solution. If you don't mind me asking, what is "value" within the map_dbl function? Is it a tidyverse key phrase? Edit: Nevermind, I spaced on my toy examples variables names. – John K. Mar 22 '18 at 20:47
This could be very slow for a 10^6 rows data frame no? – Frostic Mar 22 '18 at 20:48
1

Value is your variable from your data set I think – Frostic Mar 22 '18 at 20:49
1

As Max Ft says :). I updated with a small explanation, tell me if it's still confusing. – moodymudskipper Mar 22 '18 at 20:54

Frostic · Answer 2 · 2018-03-22T22:34:20.267

Less compact and a little tricky but way faster and using only data.table.

The trick is that once your data is sorted in descending order by value for each group_id all you need to compute is a cumulative sum by group_id which is very fast.

Whenever a value appears multiple times within a group you want to keep the last cumulative sum which had taken into account all previous occurrences.

library(data.table)
DT=as.data.table(sample_df)[order(group_id,-value),]
DT[,output:=cumsum(value)-value,keyby=.(group_id)]
temp=DT[, .SD[.N], by=.(group_id,value)]  # Keep the last row by group and value
DT=merge(setDF(sample_df)[,.(group_id,value)],temp,by=c("group_id","value"),sort=F) 

#    group_id value output
# 1:        1    10     38
# 2:        1    12     26
# 3:        1    12     26
# 4:        1    14      0
# 5:        2     8     39
# 6:        2     8     39
# 7:        2    10     21
# 8:        2    21      0

This solution is 4000 times faster than the alternatives solutions proposed for the 10^6 observations benchmark. It works for up to 10^8 observations in less than a minute.

#       N data.table.trick             dplyr          sapply              base
#1: 1e+06 0.067678928 secs 261.32966185 secs 282.639625 secs 275.08949995 secs
#2: 1e+05 0.013036013 secs   3.55517507 secs   5.356922 secs   3.36490607 secs
#3: 1e+04 0.007019043 secs   0.09926391 secs   0.312326 secs   0.04562092 secs

I computed the time with sys.Time() using the following benchmark:

N=10^8 # observation
G=20 # group
V=100 # values
sample_df = data.table(
  group_id = sample(1:G,N,replace=T),
  value = sample(1:V,V,replace=T)
)

nice trick to use the `cumsum` ! can you maybe include your benchmark ? — moodymudskipper, Mar 22 '18 at 21:29
worth mentionning also that you're losing the row order, it may be important for OP. — moodymudskipper, Mar 22 '18 at 21:30
I will edit my answer. By benchmark do you mean the simulated dataset I used to perform the time comparison? — Frostic, Mar 22 '18 at 21:34
Yes, the reproducible comparison, on `SO` `microbenchmark` is often used, you'll find a lot of them if you look it up in the search tool. — moodymudskipper, Mar 22 '18 at 21:38
This will preserve your row order: `merge(setDF(sample_df)[,.(group_id,value)],temp,by=c("group_id","value"),sort=F)` — moodymudskipper, Mar 22 '18 at 21:47

Jilber Urbina · Answer 3 · 2018-03-22T20:53:17.540

Using R base *apply functions. Not as readable as @Moody_Mudskipper answer, but same output without any extra package.

sample_df$output <- unlist(lapply(split(sample_df, sample_df$group_id), function(x){
  sapply(1:nrow(x), function(i){
    sum(x$value[x$value >= x$value[i]])-x$value[i];
  })
}))

sample_df

  group_id value output
1        1    10     38
2        1    12     26
3        1    14      0
4        1    12     26
5        2     8     39
6        2     8     39
7        2    21      0
8        2    10     21

score 3 · Answer 4 · answered Mar 23 '18 at 17:30

3

This is a simple non-equi join problem:

library(data.table)
dt = as.data.table(sample_df)

dt[dt, on = .(group_id, value >= value), by = .EACHI,
   .(output = sum(x.value) - i.value)]
#   group_id value output
#1:        1    10     38
#2:        1    12     26
#3:        1    14      0
#4:        1    12     26
#5:        2     8     39
#6:        2     8     39
#7:        2    21      0
#8:        2    10     21

answered Mar 23 '18 at 17:30

eddi

49,088
6
104
155

great solution! This will help some: https://stackoverflow.com/questions/27004002/eachi-in-data-table – moodymudskipper Mar 25 '18 at 21:57

Sum of values greater than or equal too for each element in grouped dataframe (dplyr) R

4 Answers4