10

I have a relatively large dataframe (~2,000,000 rows) where, for each row, I need to, within that observation's group, calculate the sum of every value greater than or equal to the current row's value.

Here is a sample data frame:

sample_df = data.frame(
  group_id = c(1,1,1,1,2,2,2,2),
  value = c(10,12,14,12,8,8,21,10)
)

I currently have a very slow solution to do this using a loop and some filtering, however, having a faster solution would be much preferred. I've been trying to use dplyr however, I can not figure out how I can get the sum of the other observations after the data are grouped.

With the above toy example, here would be the desired output:

desired_output = data.frame(
  group_id = c(1,1,1,1,2,2,2,2),
  value = c(10,12,14,12,8,8,21,10),
  output = c(38,26,0,26,39,39,0,21)
)

Looking around for solutions to this that have already been posted, I haven't seen a clear answer which explains how one can compare each observation in a group to the other observations, filtered on some criteria, in that group. I'd prefer a dplyr-based solution but if there are efficient base-R or data.table solutions than I'd be equally grateful!

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
John K.
  • 113
  • 1
  • 7

4 Answers4

11

Using tidyverse. The trick is to use map_dbl to loop on every value.

library(tidyverse)
sample_df %>%
  group_by(group_id) %>%
  mutate(output= map_dbl(value,~sum(value[value>=.x]))-value) %>%
  ungroup

# A tibble: 8 x 3
  group_id value output
     <dbl> <dbl>  <dbl>
1        1    10     38
2        1    12     26
3        1    14      0
4        1    12     26
5        2     8     39
6        2     8     39
7        2    21      0
8        2    10     21

value in the mutate line is your value 'subcolumn' (the group), while .x is its element you're looping on.

a base solution

within(sample_df,output <- unlist(tapply(
  value,group_id,function(x) sapply(x,function(y) sum(x[x>=y])-y))))
#   group_id value output
# 1        1    10     38
# 2        1    12     26
# 3        1    14      0
# 4        1    12     26
# 5        2     8     39
# 6        2     8     39
# 7        2    21      0
# 8        2    10     21
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
9

Less compact and a little tricky but way faster and using only data.table.

The trick is that once your data is sorted in descending order by value for each group_id all you need to compute is a cumulative sum by group_id which is very fast.

Whenever a value appears multiple times within a group you want to keep the last cumulative sum which had taken into account all previous occurrences.

library(data.table)
DT=as.data.table(sample_df)[order(group_id,-value),]
DT[,output:=cumsum(value)-value,keyby=.(group_id)]
temp=DT[, .SD[.N], by=.(group_id,value)]  # Keep the last row by group and value
DT=merge(setDF(sample_df)[,.(group_id,value)],temp,by=c("group_id","value"),sort=F) 

#    group_id value output
# 1:        1    10     38
# 2:        1    12     26
# 3:        1    12     26
# 4:        1    14      0
# 5:        2     8     39
# 6:        2     8     39
# 7:        2    10     21
# 8:        2    21      0

This solution is 4000 times faster than the alternatives solutions proposed for the 10^6 observations benchmark. It works for up to 10^8 observations in less than a minute.

#       N data.table.trick             dplyr          sapply              base
#1: 1e+06 0.067678928 secs 261.32966185 secs 282.639625 secs 275.08949995 secs
#2: 1e+05 0.013036013 secs   3.55517507 secs   5.356922 secs   3.36490607 secs
#3: 1e+04 0.007019043 secs   0.09926391 secs   0.312326 secs   0.04562092 secs

I computed the time with sys.Time() using the following benchmark:

N=10^8 # observation
G=20 # group
V=100 # values
sample_df = data.table(
  group_id = sample(1:G,N,replace=T),
  value = sample(1:V,V,replace=T)
)
Frostic
  • 680
  • 4
  • 11
4

Using R base *apply functions. Not as readable as @Moody_Mudskipper answer, but same output without any extra package.

sample_df$output <- unlist(lapply(split(sample_df, sample_df$group_id), function(x){
  sapply(1:nrow(x), function(i){
    sum(x$value[x$value >= x$value[i]])-x$value[i];
  })
}))

sample_df

  group_id value output
1        1    10     38
2        1    12     26
3        1    14      0
4        1    12     26
5        2     8     39
6        2     8     39
7        2    21      0
8        2    10     21
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
3

This is a simple non-equi join problem:

library(data.table)
dt = as.data.table(sample_df)

dt[dt, on = .(group_id, value >= value), by = .EACHI,
   .(output = sum(x.value) - i.value)]
#   group_id value output
#1:        1    10     38
#2:        1    12     26
#3:        1    14      0
#4:        1    12     26
#5:        2     8     39
#6:        2     8     39
#7:        2    21      0
#8:        2    10     21
eddi
  • 49,088
  • 6
  • 104
  • 155