I am having trouble figuring out how to sum rows in a data.table
while omitting the values of a certain group in the process.
Let's say I have a data.table
of the following form:
library(data.table)
dt <- data.table(year = c(2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003),
name = c("Tom", "Tom", "Tom", "Tom", "Fred", "Fred", "Fred", "Fred", "Gill", "Gill", "Gill", "Gill", "Ann", "Ann", "Ann", "Ann"),
g1 = c(1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1),
g2 = c(1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1),
g3 = c(1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1),
g4 = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1))
setkey(dt, name, year)
where g1
-g4
are indicator variables for games in which the players in name
participated at time year
.
What I want to do is to calculate the number of players for each game NPg1
-NPg4
in which both players participated in the focal game, but only if they also played against each other in another game in the same year and this sum should exclude the player for whom it is being calculated.
I get close using this code modified from how to cumulatively add values in one vector in R e.g for NPg1
:
dtg1 <- dt[,.SD[(g1==1) & (g2==1 | g3==1 | g4==1)][, NPg1:= sum(g1)], by=year]
This subsets the dt
on my conditions and creates the sum, however, the sum includes the focal players. For example NPg1
in year==2000
is 1 for Tom, but it should be 0 because even though he played in g1
he did not play another player in another game in that year. Once I get the sum right, I can then do this for each game and combine the results back into a data.table
. The main question is, how can I get the correct sum with these conditions.
The result for NPg1
should look like this
dtg1$NPg1result <- c(0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3)
Any ideas would be greatly appreciated.
After @Mike.Gahan's comment:
This is the sub-result for g1
, maybe this does not become very clear form my post. Once I have that correctly I could easily join it back to the full data.table
using:
library(plyr)
dt <- join(dt, dtg1)
or other merge/join operations but since my question is mainly concerned with the sub-result I did not want to bother everyone with the rest.
EDIT after @ Ricardo Saportas solution
The full desired result with all the games looks as follows:
dtresult <- data.table(year = c(2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003),
name = c("Ann", "Ann", "Ann", "Ann", "Fred", "Fred", "Fred", "Fred", "Gill", "Gill", "Gill", "Gill", "Tom", "Tom", "Tom", "Tom"),
NPg1 = c(0, 1, 3, 3, 0, 0, 3, 3, 0, 0, 3, 3, 0, 1, 3, 3),
NPg2 = c(0, 0, 2, 3, 0, 0, 2, 3, 1, 0, 0, 3, 1, 0, 2, 3),
NPg3 = c(0, 0, 3, 2, 0, 2, 3, 0, 1, 2, 3, 2, 1, 2, 3, 2),
NPg4 = c(0, 0, 2, 2, 0, 1, 0, 0, 0, 1, 2, 2, 0, 0, 2, 2))