1

I've read several related threads on how to apply many different functions to many different columns in data.tables. This one about columns and groups and a similar one here. They were both very helpful, but I am looking for a more elegant solution to something quite similar.

From the two above links, the following code:

library(data.table)

DT <- data.table(x= rnorm(50), y = rnorm(50), treatment = c(0,1))
vars <- c("x", "y")
  
my.summary = function(x) c(Mean = mean(x, na.rm = T), Min = min(x, na.rm = T), Q1 = quantile(x, 0.25, na.rm =T), 
                             Median = median(x, na.rm = T), Q3 = quantile(x, 0.75, na.rm=T), Max = max(x, na.rm = T)) 
summ_stats <- DT[, as.list(unlist(lapply(.SD, my.summary))), .SDcols = vars, by = .(treatment)]

produces summary statistics for the variables in var by treatment status. The output looks something like:

   x.Mean, y Min, y Q1.25%, ..., x.Max, y.Mean, y.Min, ...., y.Max
x
y

I am looking for something similar, but what I want exactly (using data.table's speed) is something that looks like:

variable treatment  Max Min Q1 Median Q3, Max, p.value 
x           0
            1
y           0
            1

Any suggestions would be greatly appreciated!

  • 1
    None of base (`data.frame`), dplyr (`tbl_df`), or `data.table` show tables like that. What you're asking for is not so much about data processing (for which `dplyr` and `data.table` are designed), it is about report rendering. I discourage modifying *data* to look like that if it's not for a report, because anything done to it later may need to *infer* what an empty value should really be, and while that seems innocuous with you controlling how it is put together, it's simple until it fails. – r2evans Mar 19 '21 at 20:32
  • Thanks for your comment! It would be for a report where I throw the above modified data.table into a function like kable() in my R markdown notebook. Appreciate your comment, however! – plausibly_exogenous Mar 19 '21 at 20:37

1 Answers1

4

From the OP's output, we can use melt to reshape to 'long' format

library(data.table)
out <- melt(summ_stats, id.vars = 'treatment', measure = 
 patterns("Mean", "Min", "Q1.25%", "Median", "Q3.75%", "Max"), 
  value.name = c("Mean", "Min", "Q1.25%", "Median", "Q3.75%", "Max") )[, 
         variable := c("x", "y")[variable]][]

out[duplicated(variable), variable := ""][]
setcolorder(out, c("variable", setdiff(names(out), "variable")))
out
#   variable treatment        Mean       Min     Q1.25%      Median    Q3.75%      Max
#1:        x         0 -0.04316915 -1.624365 -0.5417604 -0.16117851 0.4965782 2.229262
#2:                  1  0.14239444 -2.904899 -0.7059286  0.27866472 1.2693872 1.696948
#3:        y         0  0.32307227 -1.648222 -0.4209979  0.49096737 1.3184009 2.507111
#4:                  1 -0.21832078 -1.890027 -0.6968235 -0.08252376 0.3694591 1.034514

Or another option is collap from collapse which can reshape to a more convenient format in a simplified code

library(collapse)
collap(DT, ~ treatment, list(fmean, fmin, fmedian, fmax), return = "long")
#   Function           x           y treatment
#1:    fmean -0.04316915  0.32307227         0
#2:    fmean  0.14239444 -0.21832078         1
#3:     fmin -1.62436453 -1.64822174         0
#4:     fmin -2.90489906 -1.89002714         1
#5:  fmedian -0.16117851  0.49096737         0
#6:  fmedian  0.27866472 -0.08252376         1
#7:     fmax  2.22926220  2.50711115         0
#8:     fmax  1.69694788  1.03451432         1

The fprefix functions are already optimized and would be very fast

Also, the descr can get the quantile per group along with other descriptive statistics

descr(DT,  g = DT$treatment) 
akrun
  • 874,273
  • 37
  • 540
  • 662