1

I am having trouble figuring out how to sum rows in a data.table while omitting the values of a certain group in the process.

Let's say I have a data.table of the following form:

library(data.table)
dt <- data.table(year = c(2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003), 
               name = c("Tom", "Tom", "Tom", "Tom", "Fred", "Fred", "Fred", "Fred", "Gill", "Gill", "Gill", "Gill", "Ann", "Ann", "Ann", "Ann"),
               g1 = c(1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1),
               g2 = c(1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1),
               g3 = c(1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1),
               g4 = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1))

setkey(dt, name, year)

where g1-g4 are indicator variables for games in which the players in name participated at time year.

What I want to do is to calculate the number of players for each game NPg1-NPg4 in which both players participated in the focal game, but only if they also played against each other in another game in the same year and this sum should exclude the player for whom it is being calculated.

I get close using this code modified from how to cumulatively add values in one vector in R e.g for NPg1:

dtg1 <- dt[,.SD[(g1==1) & (g2==1 | g3==1 | g4==1)][, NPg1:= sum(g1)], by=year]

This subsets the dt on my conditions and creates the sum, however, the sum includes the focal players. For example NPg1 in year==2000 is 1 for Tom, but it should be 0 because even though he played in g1 he did not play another player in another game in that year. Once I get the sum right, I can then do this for each game and combine the results back into a data.table. The main question is, how can I get the correct sum with these conditions.

The result for NPg1 should look like this

dtg1$NPg1result <- c(0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3)

Any ideas would be greatly appreciated.

After @Mike.Gahan's comment:

This is the sub-result for g1, maybe this does not become very clear form my post. Once I have that correctly I could easily join it back to the full data.table using:

library(plyr)
dt <- join(dt, dtg1)

or other merge/join operations but since my question is mainly concerned with the sub-result I did not want to bother everyone with the rest.

EDIT after @ Ricardo Saportas solution

The full desired result with all the games looks as follows:

dtresult <- data.table(year = c(2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003), 
                   name = c("Ann", "Ann", "Ann", "Ann", "Fred", "Fred", "Fred", "Fred", "Gill", "Gill", "Gill", "Gill", "Tom", "Tom", "Tom", "Tom"), 
                   NPg1 = c(0, 1, 3, 3, 0, 0, 3, 3, 0, 0, 3, 3, 0, 1, 3, 3), 
                   NPg2 = c(0, 0, 2, 3, 0, 0, 2, 3, 1, 0, 0, 3, 1, 0, 2, 3), 
                   NPg3 = c(0, 0, 3, 2, 0, 2, 3, 0, 1, 2, 3, 2, 1, 2, 3, 2), 
                   NPg4 = c(0, 0, 2, 2, 0, 1, 0, 0, 0, 1, 2, 2, 0, 0, 2, 2))
Community
  • 1
  • 1
Rkook
  • 63
  • 1
  • 11
  • Are you sure that is the desired result? It only has 11 elements. Also, why doesn't Tom play Ann in 200? – Mike.Gahan Aug 27 '14 at 00:55
  • @Mike.Gahan, Tom does play Ann in `g1` but in no other game in 2000 because she did not participate in other games in 2000. The `g` variables are indicator variables for participation in the game (1 for participation, 0 for non participation), there might be other players not listed here that played in let's say `g2` in 2000 but non of the players here did while also playing in `g1`. So obviously the results for Tom would change if these players were included. I don't think it changes the logic though. I hope this clarifies. – Rkook Aug 27 '14 at 01:20

1 Answers1

2

One approach is to do a cartesian join on the year-g1-g2-..-gn combinations.

Then on the new table, you can "ignore the rows" [see note at bottom] that do not qualify -- namely, players playing against themselves, and those player-combinations that only played one game.

setkeyv(dt, c("year", games))
dt.merged <- merge(dt, dt, all=TRUE, allow.cartesian=TRUE, suffixes=c("", ".y"))
## ignore players playing against themselves
dt.merged[name != name.y, (games) := 0 ]
## ignore player combinations that only shared one game
dt.merged[ (rowSums(dt.merged[, games, with=FALSE]) <= 1) , (games) := 0 ]
## now just sum itup
results <- dt.merged[, lapply(.SD, sum), keyby=list(year, name), .SDcols=games]
## clean up the names
setnames(results, games, paste0("NP", games))

Which yields

results

    year name g1 g2 g3 g4
 1: 2000  Ann  0  0  0  0
 2: 2000 Fred  0  0  0  0
 3: 2000 Gill  0  1  1  1
 4: 2000  Tom  1  1  1  0
 5: 2001  Ann  1  1  0  0
 6: 2001 Fred  0  0  1  1
 7: 2001 Gill  0  0  1  1
 8: 2001  Tom  1  0  1  0
 9: 2002  Ann  1  1  1  1
10: 2002 Fred  1  1  1  0
11: 2002 Gill  1  0  1  1
12: 2002  Tom  1  1  1  1
13: 2003  Ann  1  1  1  1
14: 2003 Fred  1  1  0  0
15: 2003 Gill  1  1  1  1
16: 2003  Tom  1  1  1  1

Note that you have two options to "ignore the row"

If you want to preserve the "0" count for the year-player, then use

dt.merged[ <filter>,  (games) := 0 ]

If you do not care for the "0" count for the year-player, then use

dt.merged <- dt.merged[ ! <filter> ]
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Thanks a lot for the quick reply. I define `games <- c("g1", "g2", "g3", "g4")` and get the same results. If I understand this correctly, this gets rid of all the indicator variables which should not be considered in the final summing e.g. for Ann in `g1` in 2000. So that now I can use my conditions on these results like: `results[,.SD[(NPg1==1) & (NPg2==1 | NPg3==1 | NPg4==1)][, Totg1:= sum(NPg1)-1], by=year]` to get the number of players the focal player faces that he/she also faced in another game in a given year, and do this for each game. Or am I missing something. – Rkook Aug 27 '14 at 07:49
  • Hi @Rkook, nope. I think you misunderstood what is going on here. The `games` variable is simply used to not have to keep listing them over and over again. The actual "trick" is with the `cartesian join` and by comparing the two different `name / name.y` columns. Have a look at `DT.merged` when it is first created. – Ricardo Saporta Aug 27 '14 at 19:38
  • Thanks for the clarification @ Ricardo Saporta, but it leaves me a bit confused because the `results` seems to be the number of players with whom a a player played at least more than one game in a given year (e.g. for Ann 2000 `g1` is now 0) but I wanted to get the total number of players a player played in a given year if he/she also played that players in another game in that year. So for example for Tom in 2003 that number should be 3 in the `NPg1` variable because he played 3 other players in both `g1` and game other than `g1` in 2003. I have include a full desired result for clarity. – Rkook Aug 27 '14 at 23:09