0

I'm trying to add a column to a data set that shows the sum of the data in one column for each group id in another column. the sum or total column will have empty rows with one sum per group.

aggregate(Diff ~ Group, data.set, sum) gives me the correct sum but strips out all the other rows. While something like: data.set$Total <- ave(data.set$Diff, factor(data.set$Group), FUN=sum); gives me the new column Total but with no values. As an example, the input data set looks like this:

Group  Diff
1 
1     -16055
1     -1313
1      45707
1      6569
2 
2     -7249
2      2
3 
3     -384724

I'd like the output to look like this:

Group   Diff  Total
1 
1     -16055
1     -1313
1      45707
1      6569     34908 
2 
2     -7249
2      2       -7247
3 
3     -384724  -384724

The Diff column is the result of previous calculation that finds the difference in values in another column so the first row of every group is empty. Similar examples show getting the sum of values in a column and some show by group, but none seem to show how to achieve a result consistent with the output I need. Thanks for the help

Bea
  • 1,110
  • 12
  • 20
  • 1
    Are the places where your `Diff` column is blank actually `NA`s? Or is it not a numeric variable? – Mike H. Jun 13 '17 at 14:58
  • they are blank. there is another column called Rank the a value in each row and Diff is the result of subtracting value 1 from value 2, value 2 from value 3, etc. that calculation is this:data.set$Diff <- ave(data.set$Rank, factor(data.set$Group), FUN=function(x) c(NA,diff(x*-1))) – Vito DiMercurio Jun 13 '17 at 15:04

4 Answers4

1

Assuming your Diff column is numeric and those blanks really are NA you could do:

library(data.table)
dt <- data.table(Group = c(1,1,1,1,1,2,2,2,3,3), Diff = c(NA,-16055,-1313, 45707,6569,NA,-7249,2,NA,-384724))

dt[,total := ifelse(seq_len(.N) == .N, sum(Diff, na.rm = T), NA), by = Group]

 #   Group    Diff   total
 #1:     1      NA      NA
 #2:     1  -16055      NA
 #3:     1   -1313      NA
 #4:     1   45707      NA
 #5:     1    6569   34908
 #6:     2      NA      NA
 #7:     2   -7249      NA
 #8:     2       2   -7247
 #9:     3      NA      NA
#10:     3 -384724 -384724

Alternatively you can do a merge as per Frank's suggestion in the comments:

dt[dt[, sum(Diff, na.rm=TRUE), by=Group], on=.(Group), mult="last", total := i.V1 ]

Timing Comparison:

To see how the timing compares between the @Frank's merge option, and my original solution, I varied the # groups and # observations (a high and low option for both) and ran a microbenchmark on the 4 separate datasets. The results are below and it looks like in any case other Frank's merge option is the fastest. I think the bottleneck in my solution is the ifelse and if you could remove that, it would likely be faster although by how much is uncertain.

set.seed(1)
high_grp <- 1:10000; high_obs = 1000000;
low_grp <- 1:100; low_obs = 50000;
low_grp_high_obs <- data.table(Group = sample(low_grp, high_obs, replace = T), Diff = sample(-60000:60000, high_obs, replace = T)) 
high_grp_high_obs <- data.table(Group = sample(high_grp, high_obs, replace = T), Diff = sample(-60000:60000, high_obs, replace = T)) 
low_grp_low_obs <- data.table(Group = sample(low_grp, low_obs, replace = T), Diff = sample(-60000:60000, low_obs, replace = T)) 
high_grp_low_obs <- data.table(Group = sample(high_grp, low_obs, replace = T), Diff = sample(-60000:60000, low_obs, replace = T)) 

comparison_sets <- list("Low Group; High Obs" = low_grp_high_obs, "High Group; High Obs" = high_grp_high_obs, 
                        "Low Group; Low Obs" = low_grp_low_obs, "High Group; Low Obs" = high_grp_low_obs)

comparison <- lapply(comparison_sets, function(dt) {microbenchmark::microbenchmark(orig = dt[,total := ifelse(seq_len(.N) == .N, sum(Diff, na.rm = T), NA), by = Group],
                                                                     merge = dt[dt[, sum(Diff, na.rm=TRUE), by=Group], on=.(Group), mult="last", total := i.V1 ])} )


comparison

#$`Low Group; High Obs`
#Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval
#  orig 53.16160 58.00227 69.93443 60.08673 62.57489 191.1628   100
# merge 12.93931 15.15634 17.90187 15.56495 18.33738 147.9433   100
#
#$`High Group; High Obs`
#Unit: milliseconds
#  expr       min       lq      mean    median       uq      max neval
#  orig 143.60222 151.8497 161.65825 154.85638 158.2183 281.2311   100
# merge  23.18698  23.7380  29.20126  24.86465  29.9832 153.7919   100
#
#$`Low Group; Low Obs`
#Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval
#  orig 3.047569 3.190157 3.957012 3.378145 3.692857 8.087345   100
# merge 1.685882 1.808594 1.928094 1.846520 1.953369 5.998864   100
#
#$`High Group; Low Obs`
#Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval
#  orig 65.903991 68.727469 69.861163 69.857406 70.950330 76.351860   100
# merge  3.418077  3.595673  3.831805  3.855684  3.952869  5.069314   100
Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • 1
    Alternately, `dt[dt[, sum(Diff, na.rm=TRUE), by=Group], on=.(Group), mult="last", total := i.V1 ]` somewhat comparable to here https://stackoverflow.com/q/32101368/ – Frank Jun 13 '17 at 15:05
  • 1
    @Frank, thanks! Hadn't thought of using a merge. I'll update my answer to include this additional option. – Mike H. Jun 13 '17 at 15:10
  • Ok cool. Fyi, in benchmarks for cases like this, it's probably important to consider not just # rows but also # groups... which can admittedly get tedious fast. – Frank Jun 13 '17 at 15:15
  • Mike can you tell me where I can learn about .N notation, or just how it's called, it's hard to look up and I can't find it by just looking for data.table aggregation tutorials – moodymudskipper Jun 13 '17 at 15:21
  • @Frank great point, I've included a slightly more realistic (although still not comprehensive) comparison in my answer. – Mike H. Jun 13 '17 at 15:31
  • @Moody_Mudskipper try: `?data.table::.N` – Mike H. Jun 13 '17 at 15:32
  • When I run this code, I get an error saying "unused argument By = Group" So I thought that may be the right paren was after "NA" needed to be moved to after Group but that throws an error of "not recognized" on ":=". when I run the code suggested by Frank, I also get an error of "unused argument" on by=Group. – Vito DiMercurio Jun 13 '17 at 15:46
  • R is case sensitive, are you using `By` or `by`? You should use `by` in `data.table` (if you try `By` it won't recognize it which is likely why you are getting an error). – Mike H. Jun 13 '17 at 15:49
  • sorry I typed that message in a hurry and it was nearly incoherent. I am using "by" not "By" as typed in the solution. – Vito DiMercurio Jun 13 '17 at 16:01
  • 1
    The error was my own. I hadn't called the library (data.table) first. – Vito DiMercurio Jun 13 '17 at 16:04
  • Nice. Re "I think the bottleneck in my solution is the ifelse", you could try `replace(x, .N, v)` instead maybe. – Frank Jun 13 '17 at 16:31
0

Try this. We aggregate first, then we merge to your existing dataset

result <- merge(data.set,setNames(aggregate(Diff ~ Group, data.set, sum),c("Group","Total")),all.x=TRUE)

and add result$Total[-cumsum(table(data.set$Group))] <- "" or result$Total[-cumsum(table(data.set$Group))] <- NA if you don't want the total to be repeated

moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • This solution generally works but in instances where the group only had one row (which is blank) it completely removes the entire row. This there a modification where in those instances of a group having only one row where Diff might be empty, it simply inserts the NA and leaves the row? – Vito DiMercurio Jun 13 '17 at 15:53
  • I added `all.x = TRUE` to the merge function so all data.set is preserved, it should work now – moodymudskipper Jun 13 '17 at 16:04
  • Curious artifact with this solution. It's not affecting the math, but when I use this same solution on another column (PriceDiff/PriceTotal) the placement of the Total in the original Diff column shifts to a seemingly random row within the group rather than staying in the last row. Again, this doesn't actually affect the result. I was just curious if you had an idea why that might be happening. the result in the new PriceTotal column always is in the last row of the group as expected. – Vito DiMercurio Jun 16 '17 at 14:47
  • This is really odd, I don't see how a value could jump rows, if you can give me a reproducible example i'll look into it – moodymudskipper Jun 16 '17 at 14:57
0

Is it ok if 34908 is present in front of every "1", not only in front of the last one ?

If it's not, you can use this from the library "sqldf" :

library(sqldf)

data_count = sqldf('select groupe, sum(diff) as Total from data group by groupe')
new_data = sqldf('select * from data as a inner join data_count as b on a.groupe = b.groupe')

Also if you really want NA like in your exemple, you can add this :

 for (i in 1:(dim(new_data)[1]-1)){
  if (new_data[i,"groupe"] == new_data[i+1,"groupe"]){
    new_data[i,'Total'] = NA
  }
}
MBnnn
  • 308
  • 2
  • 13
0

Another possible approach using split/unsplit :

DF <- data.frame(Group=c(1,1,1,1,1,2,2,2,3,3), 
                 Diff=c(NA,-16055,-1313,45707,6569,NA,-7249,2,NA,-384724))

customSum <- function(x){ 
  v <- x
  v[] <- NA
  v[length(v)] <- sum(x,na.rm = T)
  return(v)
}
DF$Total <- unsplit(lapply(split(DF$Diff,DF$Group),customSum),DF$Group)

> DF
   Group    Diff   Total
1      1      NA      NA
2      1  -16055      NA
3      1   -1313      NA
4      1   45707      NA
5      1    6569   34908
6      2      NA      NA
7      2   -7249      NA
8      2       2   -7247
9      3      NA      NA
10     3 -384724 -384724
digEmAll
  • 56,430
  • 9
  • 115
  • 140