I've got a data.table
that looks like this:
DT <- data.table(Feature1 = c("yes", "yes", "yes", "no", "no"),
Feature2 = c("yes", "yes", "yes", "yes", "no"),
Feature3 = c("yes", "yes", "yes", "yes", "no"),
Var1 = c("yes", "yes", "no", "yes", "no"),
Var2 = c("yes", "yes", "yes", "yes", "yes"))
DT
## Feature1 Feature2 Feature3 Var1 Var2
##1: no no no no yes
##2: no yes yes yes yes
##3: yes yes yes yes yes
##4: yes yes yes yes yes
##5: yes yes yes no yes
Now I'd like to count the occurrence and the proportion of "Var1" being "yes" for all possible combinations of the features, "Var2" being "yes" by these combinations etc. I need a count as well as the proportion of "yes"-answers again by each combination.
Getting count for one variable is easy. As I don't want to drop any combinations, I use CJ
rather than by
:
DT[,`:=`(Feature1 = as.factor(Feature1),
Feature2 = as.factor(Feature2),
Feature3 = as.factor(Feature3))]
(Btw, is there a nicer way to set a number of columns as factors at once?)
setkeyv(DT, c("Feature1", "Feature2", "Feature3", "Var1"))
DT2 <- DT[CJ(levels(Feature1), levels(Feature2), levels(Feature3), "yes"),
list(Var1.count = .N)]
DT2[, Var1 := NULL]
However, using CJ
means that I have to set a new key for each variable. What if I have 100 of them? Is there a nicer way to do this rather than setting up a for
-loop? Also, how do I get the proportions out of here? E.g., for the combination of features "yes, yes, yes", Var1 is "yes" twice and "no" once, so I'd like to get another column called Var1.prop
with the value 0.66 in the corresponding row.
In essence, this is what I aim for:
Feature1 Feature2 Feature3 Var1 Var1.count Var1.prop Var2.count Var2.prop
1: no no no yes 0 NA 1 1.00
2: no no yes yes 0 NA 0 NA
3: no yes no yes 0 NA 0 NA
4: no yes yes yes 1 1.00 1 1.00
5: yes no no yes 0 NA 0 NA
6: yes no yes yes 0 NA 0 NA
7: yes yes no yes 0 NA 0 NA
8: yes yes yes yes 2 0.66 3 1.00
The solution should be scalable for a high number of varying features and variables. I prefer using data.table
because it's much faster than normal data.frame
operations and because I found it to be easier to use in functions compared to dplyr
. Having said that, I would also accept a neat and not too inefficient solution with data.frame
.
Update after @Arun's answer. That's really neat, but it isn't well extensible to, let's say, 100 variables. I've been trying to build upon Arun's answer this way, but it only returns an empty data.table
along with a warning:
vars <- c("Var1", "Var2")
tmps <- paste0(vars, ".tmp")
ans <- DTn[, { for (var in vars){
assign(paste0(var, ".tmp"), sum(var == "yes", na.rm = TRUE));
list(assign(paste0(var, ".count"), get(paste0(var, ".tmp"))),
assign(paste0(var, ".prop"), get(paste0(var, ".tmp"))/.N)
)
}}, by = key(DT), with = FALSE]
What's going wrong here?