I'd like to apply several functions to several columns of a data.table and generate new columns based on the output. I have found similar questions here, but the answers provided did not seem to address my exact issue, e.g.:
Apply multiple functions to multiple columns in data.table
ddply to multiple columns equivalent in data.table
R data.table - Apply function A to some columns and function B to some others
Generate some data:
set.seed(1)
p <- rep(seq(1:10),4)
p
time1 <- sample(1:40, 40, replace=TRUE)
time2 <- sample(1:40, 40, replace=TRUE)
contact1 <- sample(rep(c("personal", "nonpersonal"),20), 40)
contact2 <- sample(rep(c("personal", "nonpersonal"),20), 40)
closeness1 <- sample(1:10, 40, replace=TRUE)
closeness2 <- sample(1:10, 40, replace=TRUE)
dt <- data.table::data.table(p, time1, time2, contact1, contact2, closeness1, closeness2)
This works, but seems inefficient as I am running this for each column separately:
# s1
dt[, c("scliq.s", "symgr.s") :=list(length(which(.SD<=7)), length(which(.SD>7 & .SD<=31))), .SDcols="time1", by = p]
# d1
dt[, c("scliq.d", "symgr.d") :=list(length(which(.SD<=7)), length(which(.SD>7 & .SD<=31))), .SDcols="time2", by = p]
# s2
dt[, c("pers.s", "npers.s") :=list(length(which(.SD=="personal"))/length(which(.SD=="personal" | .SD=="nonpersonal")), length(which(.SD=="nonpersonal"))/length(which(.SD=="personal" | .SD=="nonpersonal"))), .SDcols="contact1", by = p]
# d2
dt[, c("pers.d", "npers.d") :=list(length(which(.SD=="personal"))/length(which(.SD=="personal" | .SD=="nonpersonal")), length(which(.SD=="nonpersonal"))/length(which(.SD=="personal" | .SD=="nonpersonal"))), .SDcols="contact2", by = p]
I have tried modifying similar solutions from other posts. For the sake of simplicity I have tried this for just # s1
and # d1
, but would eventually like to do # s1
, # d1
, # s2
and # d2
all in one go. I am not stuck on length(which)
and just need to count the number of instances in each case (table()
is also fine, but I couldn't get data.table
to save the right output from table()
):
# option 1
my.summary = function(x) list(count1 = length(which(x<=7)), count2 = length(which(x>7 & x<=31)))
dt[, c("scliq.s", "symgr.s", "scliq.d", "symgr.d") :=unlist(lapply(.SD, my.summary)), .SDcols = c("time1", "time2"), by = p]
# option 2, note: I wasn't sure how to adapt sum/mean to a nested function call (i.e., length(which))
dt$dday <- 1 # add a constant column
dt <- dcast(dt, dday~dday, fun=list(sum, mean), value.var = c("time1", "time2"))
I succeed in generating the required number of columns. However, all four columns contain the same value in each row even though it may not be the same, as can be seen in the output of the following code snippet:
dt[, unlist(lapply(.SD, my.summary)), .SDcols = c("time1", "time2"), by = p]
A second bit that I'd like to do is to compute the means of closeness1 and 2 based on the above criteria for columns time1 and time2 (again for each value of p separately, i.e. by = p
) and to save the output in a new column each using the format "scliq" / "symgr", as above. For example, I want to compute the means of closeness1 for all scores in time1 at or below 7 and also for all scores in time1 between 8 and 31 (likewise for closeness2 and time2).
I should also note that I know how to resolve this problem using the tidyverse packages, but would be keen to learn how to do this in data.table
for concision and efficiency. Any prompts or, indeed, solutions would be extremely appreciated.