Summarize a data.table across multiple columns

Question

How do I summarize a data.table with unreliable data across multiple columns?

Specifically, given

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 behavior=c(rep(FALSE,5),rep(TRUE,5)),
                 country=c(rep(1,4),rep(2,6)),
                 language=c(rep(6,6),rep(5,4)),
                 event=1:10, key=c("user",fields))
dt
#     user behavior country language event
#  1:    3    FALSE       1        6     1
#  2:    3    FALSE       1        6     2
#  3:    3    FALSE       1        6     3
#  4:    3    FALSE       1        6     4
#  5:    3    FALSE       2        6     5
#  6:    4     TRUE       2        5     7
#  7:    4     TRUE       2        5     8
#  8:    4     TRUE       2        5     9
#  9:    4     TRUE       2        5    10
# 10:    4     TRUE       2        6     6

I want to get

#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

(here the x.name is the most common x for the user and x.support is the share events where this top x was observed)

without having to go through both fields by hand like this:

users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
             ][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
             ][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

The actual number of fields is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields. Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere.

As in the referenced question, my data set has about 10^7 rows, so I really need a solution that scales; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out].

@RicardoSaporta: I added a few words on what `support` is, but you have to look at the referenced question for more details. — sds, Apr 26 '13 at 17:29
You can substitute `dt[, sum(behavior) > 0, by=user]` with `dt[,any(behavior),by=user]`. You should also explain, why you want to apply to multiple columns programmatically. Either you have just a few columns, then it's probably not worth bothering. Or you have many columns, then it might be better to reshape your data.table to long format. — Roland, Apr 26 '13 at 18:03
@Roland: I do not want to have to edit this function if I ever modify `fields`. — sds, Apr 26 '13 at 18:28
Sam, you are here asking for help and you are being obnoxious and rude. I wonder if that's the best way to elicit others to spend their time on your troubles. — Ricardo Saporta, Apr 26 '13 at 19:27
Sam, you've gotten 4 downvotes on an intelligent and interesting question. If you are seriously unaware, you can google etiquette. — Ricardo Saporta, Apr 28 '13 at 04:16
@RicardoSaporta: if 4 persons are offended, then, obviously, I said something bad. I just don't see what or where. — sds, Apr 28 '13 at 04:57
@Arun: no, English is not my native tongue. I guess you are right, those words were rude. Sorry. — sds, May 03 '13 at 12:49

score 5 · Accepted Answer · answered Apr 26 '13 at 23:40

Does this solve your problem?

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
           behavior=c(rep(FALSE,5),rep(TRUE,5)),
           country=c(rep(1,4),rep(2,6)),
           language=c(rep(6,6),rep(5,4)),
           event=1:10, key=c("user",fields))

CalculateSupport <- function(dt, name) {
  x <- dt[, .N, by = eval(paste0('user,', name))]
  setnames(x, name, 'name')
  x <- x[, list(name[which.max(N)], max(N)/sum(N)), by = user]
  setnames(x, c('V1', 'V2'), paste0(name, c(".name", ".support")))
  x
}

users <- dt[, sum(behavior) > 0, by=user] 
setnames(users, "V1", "behavior")

Reduce(function(x, name) x[CalculateSupport(dt, name)], fields, users)

results in

   user behavior country.name country.support language.name language.support
1:    3    FALSE            1             0.8             6              1.0
2:    4     TRUE            2             1.0             5              0.8

P.S. Please please take Ricardo's comment to your question seriously. SO is full of wonderful people who are willing to help but you have to treat them nicely and with respect.

thanks for the answer. I never intend to be rude and I don't see anything rude in my questions — sds, Apr 28 '13 at 02:25

score 1 · Answer 2 · answered Apr 26 '13 at 18:17

I can't do it in one expression, since I am not sure how to reuse a created field in a data.table expression. It's also probably not the most efficient way. Maybe this will make a good starting point, though.

#Find most common country and language for each user
summ.dt<-dt[,list(behavior.summ=sum(behavior)>0,
     country.name=dt[user==.BY[[1]],.N,by=country][N==max(N),country],
     language.name=dt[user==.BY[[1]],.N,by=language][N==max(N),language]),
by=user]

#Get support for each country and language for each user
summ.dt[,c("country.support","language.support"):=list(
     nrow(dt[user==.BY[[1]] & country==country.name])/nrow(dt[user==.BY[[1]]]),
     nrow(dt[user==.BY[[1]] & language==language.name])/nrow(dt[user==.BY[[1]]])
),by=user]

    user behavior.summ country.name language.name country.support language.support
1:    3         FALSE            1             6             0.8              1.0
2:    4          TRUE            2             5             1.0              0.8

the whole point of my question is to avoid a separate expression for each element of the `fields` vector — sds, Apr 26 '13 at 18:21

Summarize a data.table across multiple columns

2 Answers2

Linked