1

My question is similar to other questions that have been asked, such as this: Sum of rows based on column-value

Using data.table in R, if I want to sum values in a column based on unique values in another column, I can do the following:

mre = data.table(
    ref=rep(LETTERS[1:5],3),
    qual=c('B','C','D','E','A','C','D','E','A','B','D','E','A','B','C'),
    score=10:24
    )

mre[,team_score:=sum(score),by=ref]

Or:

mre[,team_val:=sum(score),by=qual]

But how would I do both, simultaneously? That is, Add all elements in mre[,score] for which mre[,ref] or mre[,qual] equal one of the five unique values (A through E). So, for A, the element in the new column would be the sum of all rows in score for which A occurs in either mre[,ref] or mre[,qual] (1,5,6,9,11,13).

I can do:

mre[ref=="A"|qual=="A",team_val:=sum(score)]

but that's terribly inefficient. Surely there is a better/smarter data.table way to do this?

EDIT: Clarifying desired output, a better example

Rather than creating a new column and calculating sums by reference, which would lead to complications involving multiple values in the ref and qual columns, I can do this:

test=
  merge(setkey(setnames(mre[,sum(score),by=ref],"ref","test_val"),test_val),
        setkey(setnames(mre[,sum(score),by=qual],"qual","test_val"),test_val),
        all=T)[,team_val:=rowSums(.SD,na.rm=T),.SDcols=c("V1.x","V1.y")]

which produces:

> test
   test_val V1.x V1.y team_val
1:        A   45   54       99
2:        B   48   52      100
3:        C   51   50      101
4:        D   54   48      102
5:        E   57   51      108

I'm hoping to do this in a way that doesn't require a merge. I only need the team_val column, which I immediately use in a multiplication expression with a column from another data.table (so I could just do:

team_val =
  merge(setkey(setnames(mre[,sum(score),by=ref],"ref","test_val"),test_val),
        setkey(setnames(mre[,sum(score),by=qual],"qual","test_val"),test_val),
        all=T)[,rowSums(.SD,na.rm=T),.SDcols=c("V1.x","V1.y")]

to get:

> team_val
[1]  99 100 101 102 108

but that still requires the merge). I'm struggling to incorporate chaining (i.e., Section 2b).

Community
  • 1
  • 1
Nigel Stackhouse
  • 481
  • 4
  • 20
  • It is not entirely clear what your expected output is. A bit faster than your current code would be the use of `pmin`: `mre[pmin(ref, qual) == "A", vals := sum(score)]`. If you want sums of values equal to A and those without an A, you could put this into by, like `mre[, vals := sum(score), by=.(pmin(ref, qual) == "A")]`. – lmo Apr 24 '17 at 20:30
  • 3
    Maybe I'm misunderstanding but I think the desired output needs rethinking. Say you do the calculation for rows with A (result 99) and apply that to any rows containing A. Then you do the same for B (result 100), now what does your new column show on the A-B and B-A rows, 99 or 100? – Gladwell Apr 24 '17 at 20:40
  • You're right, @Gladwell, re-doing my initial attempt for B overwrites the values for A. I've edited the question with a better description of the desired output. – Nigel Stackhouse Apr 25 '17 at 13:59

1 Answers1

1

I think this is what you want...

sapply(unique(c(mre$ref, mre$qual)), function(x) sum(mre$score[mre$ref == x | mre$qual == x]))

 A   B   C   D   E 
99 100 101 102 108 

With a data.frame/table of 1.5m rows, it takes just over 2 seconds.

It's a bit faster if you can specify the levels explicitly.

sapply(LETTERS[1:5], function(x) sum(mre$score[mre$ref == x | mre$qual == x]))

That's just under 1.5 seconds with the 1.5m rows.

Gladwell
  • 328
  • 3
  • 10