1

I'd like to apply several functions to several columns of a data.table and generate new columns based on the output. I have found similar questions here, but the answers provided did not seem to address my exact issue, e.g.:

Apply multiple functions to multiple columns in data.table

ddply to multiple columns equivalent in data.table

R data.table - Apply function A to some columns and function B to some others

Generate some data:

set.seed(1)
p <- rep(seq(1:10),4)
p

time1 <- sample(1:40, 40, replace=TRUE)
time2 <- sample(1:40, 40, replace=TRUE)
contact1 <- sample(rep(c("personal", "nonpersonal"),20), 40)
contact2 <- sample(rep(c("personal", "nonpersonal"),20), 40)
closeness1 <- sample(1:10, 40, replace=TRUE)
closeness2 <- sample(1:10, 40, replace=TRUE)

dt <- data.table::data.table(p, time1, time2, contact1, contact2, closeness1, closeness2)

This works, but seems inefficient as I am running this for each column separately:

# s1
dt[, c("scliq.s", "symgr.s") :=list(length(which(.SD<=7)), length(which(.SD>7 & .SD<=31))), .SDcols="time1", by = p]

# d1
dt[, c("scliq.d", "symgr.d") :=list(length(which(.SD<=7)), length(which(.SD>7 & .SD<=31))), .SDcols="time2", by = p]

# s2
dt[, c("pers.s", "npers.s") :=list(length(which(.SD=="personal"))/length(which(.SD=="personal" | .SD=="nonpersonal")), length(which(.SD=="nonpersonal"))/length(which(.SD=="personal" | .SD=="nonpersonal"))), .SDcols="contact1", by = p]

# d2
dt[, c("pers.d", "npers.d") :=list(length(which(.SD=="personal"))/length(which(.SD=="personal" | .SD=="nonpersonal")), length(which(.SD=="nonpersonal"))/length(which(.SD=="personal" | .SD=="nonpersonal"))), .SDcols="contact2", by = p]

I have tried modifying similar solutions from other posts. For the sake of simplicity I have tried this for just # s1 and # d1, but would eventually like to do # s1, # d1, # s2 and # d2 all in one go. I am not stuck on length(which) and just need to count the number of instances in each case (table() is also fine, but I couldn't get data.table to save the right output from table()):

# option 1
my.summary = function(x) list(count1 = length(which(x<=7)), count2 = length(which(x>7 & x<=31)))

dt[, c("scliq.s", "symgr.s", "scliq.d", "symgr.d") :=unlist(lapply(.SD, my.summary)), .SDcols = c("time1", "time2"), by = p]

# option 2, note: I wasn't sure how to adapt sum/mean to a nested function call (i.e., length(which))
dt$dday <- 1 # add a constant column
dt <- dcast(dt, dday~dday, fun=list(sum, mean), value.var = c("time1", "time2"))

I succeed in generating the required number of columns. However, all four columns contain the same value in each row even though it may not be the same, as can be seen in the output of the following code snippet:

dt[, unlist(lapply(.SD, my.summary)), .SDcols = c("time1", "time2"), by = p]

A second bit that I'd like to do is to compute the means of closeness1 and 2 based on the above criteria for columns time1 and time2 (again for each value of p separately, i.e. by = p) and to save the output in a new column each using the format "scliq" / "symgr", as above. For example, I want to compute the means of closeness1 for all scores in time1 at or below 7 and also for all scores in time1 between 8 and 31 (likewise for closeness2 and time2).

I should also note that I know how to resolve this problem using the tidyverse packages, but would be keen to learn how to do this in data.table for concision and efficiency. Any prompts or, indeed, solutions would be extremely appreciated.

Tiberius
  • 331
  • 1
  • 9
  • 1
    Some advise: (1) if you use just one column in `.SD`, you could also use it directly (without the quote though); (2) instead of using `lenght` and `which` on the conditions, it is better to use `sum`. E.g.: `sum(time1 <= 7)`. – Jaap Jul 29 '19 at 17:10

1 Answers1

1

The reason your solution with my.summary is not working is that unlist is recursive by default, so it ends up packing all values from all nested lists in a single vector, and data.table ends up recycling values silently. Taking into account Jaap's comment, you can write:

my.summary = function(x) list(sum(x<=7), sum(x>7 & x<=31))

dt[, c("scliq.s", "symgr.s", "scliq.d", "symgr.d") := unlist(lapply(.SD, my.summary), recursive = FALSE),
   .SDcols = c("time1", "time2"), by = p]

For the means, I can think of 2 options, the first one uses .SD and by, which can be slow at times:

dt[, c("mean1", "mean2") := .(.SD[time1 <= 7, mean(closeness1)], 
                              .SD[time2 > 7 & time2 <= 31, mean(closeness2)]),
   by = p,
   .SDcols = time1:closeness2]

The other option is to calculate the means in a sub-table and then join back:

dt[dt[time1 <= 7, .(ans = mean(closeness1)), by = p], mean1 := ans, on = "p"]
dt[dt[time2 > 7 & time2 <= 31, .(ans = mean(closeness2)), by = p], mean2 := ans, on = "p"]

Depending on your actual data, one might be faster than the other, so you should time them.

Alexis
  • 4,950
  • 1
  • 18
  • 37
  • Thanks a ton. This nicely works. Do you know how I might go about the second bit, i.e. running a function on column y (e.g. mean), based on criteria in column x (e.g. anyone with a score below 7)? – Tiberius Jul 30 '19 at 09:14
  • 1
    @Tiberius I have a couple of ideas. I've updated the answer. – Alexis Jul 30 '19 at 10:09
  • Brilliant, works like a charm. I ended up doing something like this: ```dt[, c("mean1", "mean2", "mean3", "mean4") := .( .SD[time1 <= 7, mean(closeness1)], .SD[time1 > 7 & time1 <= 31, mean(closeness1)], .SD[time2 <= 7, mean(closeness2)], .SD[time2 > 7 & time2 <= 31, mean(closeness2)]), by = p, .SDcols = c("time1", "time2", "closeness1","closeness2")]``` – Tiberius Jul 30 '19 at 22:25