2

I've got a data.table that looks like this:

DT <- data.table(Feature1 = c("yes", "yes", "yes", "no", "no"),
                 Feature2 = c("yes", "yes", "yes", "yes", "no"),
                 Feature3 = c("yes", "yes", "yes", "yes", "no"),
                 Var1 = c("yes", "yes", "no", "yes", "no"),
                 Var2 = c("yes", "yes", "yes", "yes", "yes"))


DT

##   Feature1 Feature2 Feature3 Var1 Var2
##1:       no       no       no   no  yes
##2:       no      yes      yes  yes  yes
##3:      yes      yes      yes  yes  yes
##4:      yes      yes      yes  yes  yes
##5:      yes      yes      yes   no  yes

Now I'd like to count the occurrence and the proportion of "Var1" being "yes" for all possible combinations of the features, "Var2" being "yes" by these combinations etc. I need a count as well as the proportion of "yes"-answers again by each combination.

Getting count for one variable is easy. As I don't want to drop any combinations, I use CJ rather than by:

DT[,`:=`(Feature1 = as.factor(Feature1),
         Feature2 = as.factor(Feature2),
         Feature3 = as.factor(Feature3))]

(Btw, is there a nicer way to set a number of columns as factors at once?)

setkeyv(DT, c("Feature1", "Feature2", "Feature3", "Var1"))
DT2 <- DT[CJ(levels(Feature1), levels(Feature2), levels(Feature3), "yes"),
          list(Var1.count = .N)]
DT2[, Var1 := NULL]

However, using CJ means that I have to set a new key for each variable. What if I have 100 of them? Is there a nicer way to do this rather than setting up a for-loop? Also, how do I get the proportions out of here? E.g., for the combination of features "yes, yes, yes", Var1 is "yes" twice and "no" once, so I'd like to get another column called Var1.prop with the value 0.66 in the corresponding row.

In essence, this is what I aim for:

   Feature1 Feature2 Feature3 Var1 Var1.count Var1.prop Var2.count Var2.prop
1:       no       no       no  yes          0        NA         1        1.00
2:       no       no      yes  yes          0        NA         0        NA
3:       no      yes       no  yes          0        NA         0        NA
4:       no      yes      yes  yes          1        1.00       1        1.00
5:      yes       no       no  yes          0        NA         0        NA
6:      yes       no      yes  yes          0        NA         0        NA
7:      yes      yes       no  yes          0        NA         0        NA
8:      yes      yes      yes  yes          2        0.66       3        1.00

The solution should be scalable for a high number of varying features and variables. I prefer using data.table because it's much faster than normal data.frame operations and because I found it to be easier to use in functions compared to dplyr. Having said that, I would also accept a neat and not too inefficient solution with data.frame.


Update after @Arun's answer. That's really neat, but it isn't well extensible to, let's say, 100 variables. I've been trying to build upon Arun's answer this way, but it only returns an empty data.table along with a warning:

vars <- c("Var1", "Var2")
tmps <- paste0(vars, ".tmp")

ans <- DTn[, { for (var in vars){
  assign(paste0(var, ".tmp"), sum(var == "yes", na.rm = TRUE));
  list(assign(paste0(var, ".count"), get(paste0(var, ".tmp"))),
       assign(paste0(var, ".prop"), get(paste0(var, ".tmp"))/.N)
  )
}}, by = key(DT), with = FALSE]

What's going wrong here?

AnjaM
  • 2,941
  • 8
  • 39
  • 62

1 Answers1

3

You don't have to convert columns to factors. In fact, data.table recommends avoiding factors wherever possible, as it'll also improve speed. However, I'll illustrate how you can convert to factor much more easily for the future.

sd_cols = c("Feature1", "Feature2", "Feature3")
DT[, c(sd_cols) := lapply(.SD, as.factor), .SDcols=sd_cols]

Okay, now on to the solution. Of course we'll need to use CJ here because you need to get absent combinations as well. So, we've to generate that first.

uvals = c("no", "yes")
setkey(DT, Feature1, Feature2, Feature3)
DTn = DT[CJ(uvals, uvals, uvals), allow.cartesian=TRUE]

The allow.cartesian=TRUE is necessary because the join will result in more rows than max(nrow(x), nrow(i)) in a join x[i]. Read this post for more on allow.cartesian.

Now that we've all the combinations, we can group/aggregate them to obtain the results in the fashion you require.

ans = DTn[, { tmp1 = sum(Var1 == "yes", na.rm=TRUE);
             tmp2 = sum(Var2 == "yes", na.rm=TRUE);
           list(Var1.count = tmp1, 
                Var1.prop  = tmp1/.N, 
                Var2.count = tmp2,
                Var2.prop  = tmp2/.N * 100)
           }, by=key(DT)]

#    Feature1 Feature2 Feature3 Var1.count Var1.prop Var2.count Var2.prop
# 1:       no       no       no          0 0.0000000          1         1
# 2:       no       no      yes          0 0.0000000          0         0
# 3:       no      yes       no          0 0.0000000          0         0
# 4:       no      yes      yes          1 1.0000000          1         1
# 5:      yes       no       no          0 0.0000000          0         0
# 6:      yes       no      yes          0 0.0000000          0         0
# 7:      yes      yes       no          0 0.0000000          0         0
# 8:      yes      yes      yes          2 0.6666667          3         1

I think you can play around to get the values as NA instead of 0, if that's really that important?


Following OP's question under comment + edit, after getting DTn:

vars = c("Var1", "Var2")
ans = DTn[, c(N=.N, lapply(.SD, function(x) sum(x=="yes", na.rm=TRUE))), 
               by=key(DTn), .SDcols=vars]
N = ans$N
ans[, N := NULL]
ans[, c(paste(vars, "prop", sep=".")) := .SD/N, .SDcols=vars]
setnames(ans, vars, paste(vars, "count", sep="."))

ans
#    Feature1 Feature2 Feature3 Var1.count Var2.count Var1.prop Var2.prop
# 1:       no       no       no          0          1 0.0000000         1
# 2:       no       no      yes          0          0 0.0000000         0
# 3:       no      yes       no          0          0 0.0000000         0
# 4:       no      yes      yes          1          1 1.0000000         1
# 5:      yes       no       no          0          0 0.0000000         0
# 6:      yes       no      yes          0          0 0.0000000         0
# 7:      yes      yes       no          0          0 0.0000000         0
# 8:      yes      yes      yes          2          3 0.6666667         1

How about this?

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanks, this is very helpful! However, I'd like to use the code with a large number of variables and features, so I want to avoid redundancy in typing as much as possible. Ideally, I would just provide a vector of variables and a loop would do the job. I've updated the OP with a (failed) attempt to extend your code and would appreciate further help. – AnjaM May 28 '14 at 16:53
  • Just another question: Why do you need to use `c()` for converting columns into factors? `sd_cols` is already a vector, so why converting it once again? I've just did a `log`-transformation onto some columns and my machine kept crashing without the `c()` due to a memory overflow (there weren't any error messages, it just crashed completely). With `c()`, it worked perfectly. Why that? – AnjaM May 30 '14 at 07:18
  • @AnjaM, you may not need the `c`, but you need the `()`, so that `LHS` is considered an expression and evaluated to obtain the values stored in the vector. Because `data.table` also allows `DT[, col := value]`, where `col` is just a column name. – Arun May 30 '14 at 07:55
  • Ok, thanks for the explanation. Wouldn't it be reasonable for `data.table` either to automatically detect that `sd_cols` is a vector and thus has to be treated as an expression? Or at least issue a meaningful error when a vector rather than a single string is used at this place instead of letting the system crash? – AnjaM May 30 '14 at 09:14
  • What if you happen to have a variable, say `var`, that's a character vector, but what you intend to create a new column `var` and therefore did `DT[, var := value]`? How'll `data.table` know what to do with it? It's not an error either. I think it's probably useful to document this. – Arun May 30 '14 at 09:20
  • To be honest, I don't really get the problem, but maybe that's because I'm new to the concept of `data.table`. Would passing `with = FALSE` make the parentheses obsolete? Also, at the moment, if I pass just text, `data.table` already checks if the column exists, and if it doesn't, a new column is created. Why can't the same happen with a character vector? Besides that, you say this is not an error, but why does it lead to a crash of the OS rather than stopping with the message `data.table doesn't know how to deal with a vector that is not passed as an expression` or something similar? – AnjaM May 30 '14 at 09:40
  • 1
    I really don't know how else to explain. `vars = c("a", "b", "c"); DT[, vars := "bla"]`. Here, I've a variable `vars` defined, but what if I *don't* want to create columns `a,b,c`, but create a column named `vars`. It *shouldn't* evaluate by default and it's not an error as this is perfectly valid. If you feel strongly about this, please start a discussion on the mailing list with a reference to this post. – Arun May 30 '14 at 09:53
  • 1
    Ok, now I got what you mean. When I type `DT[, sd_cols := lapply(.SD, log), .SDcols=sd_cols]`, it tries to create a new column called `sd_cols` but gets a list of a couple of vectors with length = number of rows of the whole table. It then tries to assign to each entry of the new column one list element (huge vector), but that whole thing gets too large and the system crashes. Thanks a lot for your explanation and your patience! – AnjaM May 30 '14 at 10:22