0

I have a data frame with many columns, some of which are measure variables. I would like to extract a bunch of summary statistics from the latter using data.table. My problem is the following: how to rename the aggregated columns according to the function that was used?

I want to have an aggregated data.table with column names like: c("measure1_mean", "measure1_sd", "measure2_mean", "measure2_sd", ...)

My code looks like this:

library(data.table)
library(stringr)

dt <- data.table(meas1=1:10,
                 meas2=seq(5,25, length.out = 10),
                 meas3=rnorm(10),
                 groupvar=rep(LETTERS[1:5], each=2))
measure_cols <- colnames(dt)[str_detect(colnames(dt), "^meas")]
dt_agg <- dt[, c(lapply(.SD, mean),
                 lapply(.SD, sd)),
               by=groupvar, .SDcols = measure_cols]

# Does not work because of duplicates in rep(measure_cols, 3)
agg_names <- c(measure_cols, paste(rep(c("mean", "sd"), each=length(measure_cols)), measure_cols, sep="_"))
setnames(dt_agg, rep(measure_cols,3), agg_names)

This chunk effectively extracts the statistics but returns columns with identical names. Therefore I cannot use something like setnames(dt, old, new) because duplicates exist in my 'old' vector.

I came across this post: Rename aggregated columns using data.table in R. But I do not like the accepted solution because it relies on column index, and not names, to rename the columns.

majpark
  • 83
  • 9
  • 3
    Please show a small reproducible example – akrun Nov 28 '18 at 09:29
  • 1
    I think these may get you going: [Calculate multiple aggregations with lapply(.SD, …)](https://stackoverflow.com/a/24151832/1851712) and [Apply multiple functions to multiple columns](https://stackoverflow.com/a/29621821/1851712) – Henrik Nov 28 '18 at 09:41
  • as per first link given by @Henrik , `dt[, as.list(unlist(lapply(.SD, function(meas) list(mean=mean(meas), sd=sd(meas))))), by=groupvar, .SDcols = measure_cols]` gives you directly what you're after – Cath Nov 28 '18 at 10:00
  • @Henrik thank you for pointing me to relevant posts. I find the syntax rather counter-intuitive but it works like a charm! – majpark Nov 28 '18 at 14:38

1 Answers1

0
library(data.table)

dt <- data.table(meas1=1:10,
                 meas2=seq(5,25, length.out = 10),
                 meas3=rnorm(10),
                 groupvar=rep(LETTERS[1:5], each=2))
measure_cols <- colnames(dt)[str_detect(colnames(dt), "^meas")]
dt_agg <- dt[, c(lapply(.SD, mean),
                 lapply(.SD, sd)),
             by=groupvar, .SDcols = measure_cols]

you can create a vector with names... use the each argument to paste the function,names behind the measure_cols.

function.names <- c("mean", "sd")
column.names <- paste0( measure_cols, "_", rep( function.names, each = length( measure_cols ) ) )
setnames( dt_agg, c("groupvar", column.names ))

#    groupvar meas1_mean meas2_mean meas3_mean  meas1_sd meas2_sd  meas3_sd
# 1:        A        1.5   6.111111  0.2346044 0.7071068 1.571348 1.6733804
# 2:        B        3.5  10.555556  0.5144621 0.7071068 1.571348 0.0894364
# 3:        C        5.5  15.000000 -0.5469839 0.7071068 1.571348 2.1689620
# 4:        D        7.5  19.444444 -0.3898213 0.7071068 1.571348 1.0007027
# 5:        E        9.5  23.888889  0.5569743 0.7071068 1.571348 1.4499413
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thanks for your answer! However this still relies on the assumption that your columns are ordered this way, which I would specifically like to avoid. – majpark Nov 28 '18 at 14:32
  • @majpark you determine the order of columns in the lines where you calculate the mean, sd, etc... so they will always line up with a vector created with the same order of functions... – Wimpel Nov 30 '18 at 12:08