Applying many functions to many variables (long format)

Question

I've read several related threads on how to apply many different functions to many different columns in data.tables. This one about columns and groups and a similar one here. They were both very helpful, but I am looking for a more elegant solution to something quite similar.

From the two above links, the following code:

library(data.table)

DT <- data.table(x= rnorm(50), y = rnorm(50), treatment = c(0,1))
vars <- c("x", "y")
  
my.summary = function(x) c(Mean = mean(x, na.rm = T), Min = min(x, na.rm = T), Q1 = quantile(x, 0.25, na.rm =T), 
                             Median = median(x, na.rm = T), Q3 = quantile(x, 0.75, na.rm=T), Max = max(x, na.rm = T)) 
summ_stats <- DT[, as.list(unlist(lapply(.SD, my.summary))), .SDcols = vars, by = .(treatment)]

produces summary statistics for the variables in var by treatment status. The output looks something like:

   x.Mean, y Min, y Q1.25%, ..., x.Max, y.Mean, y.Min, ...., y.Max
x
y

I am looking for something similar, but what I want exactly (using data.table's speed) is something that looks like:

variable treatment  Max Min Q1 Median Q3, Max, p.value 
x           0
            1
y           0
            1

Any suggestions would be greatly appreciated!

None of base (`data.frame`), dplyr (`tbl_df`), or `data.table` show tables like that. What you're asking for is not so much about data processing (for which `dplyr` and `data.table` are designed), it is about report rendering. I discourage modifying *data* to look like that if it's not for a report, because anything done to it later may need to *infer* what an empty value should really be, and while that seems innocuous with you controlling how it is put together, it's simple until it fails. — r2evans, Mar 19 '21 at 20:32
Thanks for your comment! It would be for a report where I throw the above modified data.table into a function like kable() in my R markdown notebook. Appreciate your comment, however! — plausibly_exogenous, Mar 19 '21 at 20:37

akrun · Accepted Answer · 2021-03-19T20:53:03.193

From the OP's output, we can use melt to reshape to 'long' format

library(data.table)
out <- melt(summ_stats, id.vars = 'treatment', measure = 
 patterns("Mean", "Min", "Q1.25%", "Median", "Q3.75%", "Max"), 
  value.name = c("Mean", "Min", "Q1.25%", "Median", "Q3.75%", "Max") )[, 
         variable := c("x", "y")[variable]][]

out[duplicated(variable), variable := ""][]
setcolorder(out, c("variable", setdiff(names(out), "variable")))
out
#   variable treatment        Mean       Min     Q1.25%      Median    Q3.75%      Max
#1:        x         0 -0.04316915 -1.624365 -0.5417604 -0.16117851 0.4965782 2.229262
#2:                  1  0.14239444 -2.904899 -0.7059286  0.27866472 1.2693872 1.696948
#3:        y         0  0.32307227 -1.648222 -0.4209979  0.49096737 1.3184009 2.507111
#4:                  1 -0.21832078 -1.890027 -0.6968235 -0.08252376 0.3694591 1.034514

Or another option is collap from collapse which can reshape to a more convenient format in a simplified code

library(collapse)
collap(DT, ~ treatment, list(fmean, fmin, fmedian, fmax), return = "long")
#   Function           x           y treatment
#1:    fmean -0.04316915  0.32307227         0
#2:    fmean  0.14239444 -0.21832078         1
#3:     fmin -1.62436453 -1.64822174         0
#4:     fmin -2.90489906 -1.89002714         1
#5:  fmedian -0.16117851  0.49096737         0
#6:  fmedian  0.27866472 -0.08252376         1
#7:     fmax  2.22926220  2.50711115         0
#8:     fmax  1.69694788  1.03451432         1

The fprefix functions are already optimized and would be very fast

Also, the descr can get the quantile per group along with other descriptive statistics

descr(DT,  g = DT$treatment)

Applying many functions to many variables (long format)

1 Answers1