3

How do I summarize a data.table with unreliable data across multiple columns?

Specifically, given

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 behavior=c(rep(FALSE,5),rep(TRUE,5)),
                 country=c(rep(1,4),rep(2,6)),
                 language=c(rep(6,6),rep(5,4)),
                 event=1:10, key=c("user",fields))
dt
#     user behavior country language event
#  1:    3    FALSE       1        6     1
#  2:    3    FALSE       1        6     2
#  3:    3    FALSE       1        6     3
#  4:    3    FALSE       1        6     4
#  5:    3    FALSE       2        6     5
#  6:    4     TRUE       2        5     7
#  7:    4     TRUE       2        5     8
#  8:    4     TRUE       2        5     9
#  9:    4     TRUE       2        5    10
# 10:    4     TRUE       2        6     6

I want to get

#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

(here the x.name is the most common x for the user and x.support is the share events where this top x was observed)

without having to go through both fields by hand like this:

users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
             ][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
             ][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

The actual number of fields is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields. Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere.

As in the referenced question, my data set has about 10^7 rows, so I really need a solution that scales; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out].

Community
  • 1
  • 1
sds
  • 58,617
  • 29
  • 161
  • 278
  • could you explain the caluclation of `x.support` – Ricardo Saporta Apr 26 '13 at 17:24
  • @RicardoSaporta: I added a few words on what `support` is, but you have to look at the referenced question for more details. – sds Apr 26 '13 at 17:29
  • 1
    You can substitute `dt[, sum(behavior) > 0, by=user]` with `dt[,any(behavior),by=user]`. You should also explain, why you want to apply to multiple columns programmatically. Either you have just a few columns, then it's probably not worth bothering. Or you have many columns, then it might be better to reshape your data.table to long format. – Roland Apr 26 '13 at 18:03
  • @Roland: I do not want to have to edit this function if I ever modify `fields`. – sds Apr 26 '13 at 18:28
  • 4
    Sam, you are here asking for help and you are being obnoxious and rude. I wonder if that's the best way to elicit others to spend their time on your troubles. – Ricardo Saporta Apr 26 '13 at 19:27
  • @RicardoSaporta: what specifically is "obnoxious and rude"? – sds Apr 28 '13 at 02:24
  • Sam, you've gotten 4 downvotes on an intelligent and interesting question. If you are seriously unaware, you can google etiquette. – Ricardo Saporta Apr 28 '13 at 04:16
  • @RicardoSaporta: if 4 persons are offended, then, obviously, I said something bad. I just don't see what or where. – sds Apr 28 '13 at 04:57
  • 2
    @Arun: no, English is not my native tongue. I guess you are right, those words were rude. Sorry. – sds May 03 '13 at 12:49

2 Answers2

5

Does this solve your problem?

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
           behavior=c(rep(FALSE,5),rep(TRUE,5)),
           country=c(rep(1,4),rep(2,6)),
           language=c(rep(6,6),rep(5,4)),
           event=1:10, key=c("user",fields))

CalculateSupport <- function(dt, name) {
  x <- dt[, .N, by = eval(paste0('user,', name))]
  setnames(x, name, 'name')
  x <- x[, list(name[which.max(N)], max(N)/sum(N)), by = user]
  setnames(x, c('V1', 'V2'), paste0(name, c(".name", ".support")))
  x
}

users <- dt[, sum(behavior) > 0, by=user] 
setnames(users, "V1", "behavior")

Reduce(function(x, name) x[CalculateSupport(dt, name)], fields, users)

results in

   user behavior country.name country.support language.name language.support
1:    3    FALSE            1             0.8             6              1.0
2:    4     TRUE            2             1.0             5              0.8

P.S. Please please take Ricardo's comment to your question seriously. SO is full of wonderful people who are willing to help but you have to treat them nicely and with respect.

Victor K.
  • 4,054
  • 3
  • 25
  • 38
  • thanks for the answer. I never intend to be rude and I don't see anything rude in my questions – sds Apr 28 '13 at 02:25
1

I can't do it in one expression, since I am not sure how to reuse a created field in a data.table expression. It's also probably not the most efficient way. Maybe this will make a good starting point, though.

#Find most common country and language for each user
summ.dt<-dt[,list(behavior.summ=sum(behavior)>0,
     country.name=dt[user==.BY[[1]],.N,by=country][N==max(N),country],
     language.name=dt[user==.BY[[1]],.N,by=language][N==max(N),language]),
by=user]

#Get support for each country and language for each user
summ.dt[,c("country.support","language.support"):=list(
     nrow(dt[user==.BY[[1]] & country==country.name])/nrow(dt[user==.BY[[1]]]),
     nrow(dt[user==.BY[[1]] & language==language.name])/nrow(dt[user==.BY[[1]]])
),by=user]

    user behavior.summ country.name language.name country.support language.support
1:    3         FALSE            1             6             0.8              1.0
2:    4          TRUE            2             5             1.0              0.8
Matt
  • 537
  • 5
  • 19
  • the whole point of my question is to avoid a separate expression for each element of the `fields` vector – sds Apr 26 '13 at 18:21