1

I have a question which really need your help:

set.seed(1111)
s<-rep(seq(1,4),5)
a<-runif(20,0.2,0.6)
b<-runif(20,0.4,0.7)
b[6:8]<-NA
c<-runif(20,4,7)
d<-data.table(s,a,b,c)
setkey(d,s)

The data is as following:

    s         a         b        c
 1: 1 0.3862011 0.4493240 6.793058
 2: 1 0.4955267 0.4187441 4.708561
 3: 1 0.4185155 0.5916827 6.810053
 4: 1 0.5003833 0.5403744 5.948629
 5: 1 0.5667312 0.5634135 6.880848
 6: 2 0.3651699 0.5263655 5.721908
 7: 2 0.5905308        NA 6.863213
 8: 2 0.2560464 0.4649180 5.745656
 9: 2 0.4533625 0.5077432 5.958526
10: 2 0.4228027 0.4340407 5.115065
11: 3 0.5628013 0.6517352 6.252962
12: 3 0.5519840        NA 4.875669
13: 3 0.2006761 0.6418540 5.452210
14: 3 0.5472671 0.4503713 6.962282
15: 3 0.5601675 0.5195013 6.666593
16: 4 0.2548422 0.6962112 5.535579
17: 4 0.2467137        NA 6.680080
18: 4 0.4995830 0.6793684 6.334579
19: 4 0.2637452 0.4078512 6.076039
20: 4 0.5063548 0.4055017 5.287291

If I do a simple sum, using s as key, it will return a nice table summarize the result:

d[,sum(c),by=s]

s       V1
1: 1 31.14115
2: 2 29.40437
3: 3 30.20972
4: 4 29.91357

However, if my data.table command contain ifelse statement, I will not get similar table:

 d2<-d[,ifelse(a<b,"NA",sum(c)),by=s]
 d2


 s               V1
 1: 1               NA
 2: 1 31.1411493057385
 3: 1               NA
 4: 1               NA
 5: 1 31.1411493057385
 6: 2               NA
 7: 2               NA
 8: 2               NA
 9: 2               NA
10: 2               NA
11: 3               NA
12: 3               NA
13: 3               NA
14: 3 30.2097161230631
15: 3 30.2097161230631
16: 4               NA
17: 4               NA
18: 4               NA
19: 4               NA
20: 4 29.9135677714366

Is that possible to use the ifelse statement return a result just like the simple sum result table which return the unique non-na value under the each index value?

Thanks a lot!!!!!

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
sxgn
  • 137
  • 7

3 Answers3

3

I am not entirely certain what you are looking for, but I think you just want to use the a<b condition as the row selector in your data.table, which is done by using it as the first argument in the brackets:

> d[a<b, sum(c), by = s]
s V1  
1: 1 19.6
2: 2 22.5
3: 3 11.7
4: 4 17.9
Fojtasek
  • 3,492
  • 1
  • 26
  • 23
  • My understanding on the line `d2<-d[,ifelse(a – KFB Feb 28 '15 at 06:37
  • Thanks to both of you, even my self I get confused what I am trying to do. But you actually did what I want..I used to think the statement right after the `[` is only for the table joint use, but now I know it could be used as conditional statement. Really thanks for your help!! – sxgn Feb 28 '15 at 12:00
1

There is a simple and fast solution based on conditional sum using which:

d[, .( sum_c = sum(c[which( a < b)]) ), by=s]

#    s  sum_c
# 1: 1 19.552
# 2: 2 22.541
# 3: 3 11.705
# 4: 4 17.946

The advantage of this structure over the other answers presented so far is that it allows you to calculate different aggregations in the same call using different conditions, for example:

d[, .(  sum_c = sum(c[which( a < b)]),
        sum_a = sum(c[which( c < 6)]) ), by=s]

#    s  sum_c  sum_a
# 1: 1 19.552 10.657
# 2: 2 22.541 22.541
# 3: 3 11.705 10.328
# 4: 4 17.946 10.823

There is a benchmark of the speed of this solution compared to other approaches in a similar question, here.

Community
  • 1
  • 1
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
0
  library(plyr)  
  ddply(d[a<b], .(s), summarize, tot=sum(c))
Amrita Sawant
  • 10,403
  • 4
  • 22
  • 26