6

I want to summarise several variables in data.table, output in wide format, output possibly as a list per variable. Since several other approaches did not work, I tried to do an outer lapply, giving the names of the variables as character vectors. I wanted to pass these in, using with=FALSE.

carsx=as.data.table(cars)
lapply( list(speed="speed",dist= "dist"), #error object 'ansvals' not found
    function(x)  carsx[,list(mean(x), min(x), max(x) ), with=FALSE ] ) 

Since this does not work, I tried the more simple approach without lapply.

carsx[,list(mean("speed"), min("speed"), max("speed") ), with=FALSE ] #error object 'ansvals' not found

This does not work either. Is there any way to do something like this? Is this behaviour of 'with' wanted? (I am aware that ?data.table mentions with only to select columns, but in my case it would be useful to be able to transform them as well)

When with=FALSE, j is a vector of names or positions to select, similar to a data.frame. with=FALSE is often useful in data.table to select columns dynamically.

EDIT My aim is to get a summary per group in wide format, for different variables. I tried to extend the following, which works only for one variable, for a list of variables.

carsx[,list(mean(speed), min(speed), max(speed) ) ,by=(dist>50)

Lamentably SO doesnt let me post my other question. There I described that I want an output similiar to:

lapply( list(speed="speed",dist= "dist"),
        function(x) do.call("as.data.frame", aggregate(cars[,x], list(class=cars$dist>50), FUN=summary) ) )

Expected Output would be something like:

$speed 
         V1       V2 V3
1: FALSE 12.96970  4 20
2:  TRUE 20.11765 14 25

$dist
         V1       V2 V3
1: FALSE 12.96970  4 20
2:  TRUE 20.11765 14 25
Jaap
  • 81,064
  • 34
  • 182
  • 193
Julian
  • 741
  • 8
  • 19
  • I want to get summary statistics in a wide format, possibly in a list per (grouped) variable. For that I tried the outer lapply, and therefore Svens' answer is also not the right one for me... – Julian Nov 10 '14 at 13:13
  • I group by (dist>50) – Julian Nov 10 '14 at 13:35

2 Answers2

5

You can specify the columns with the .SDcols parameter:

carsx[ , lapply(.SD, function(x) c(mean(x), min(x), max(x))), 
      .SDcols = c("speed", "dist")]
#    speed   dist
# 1:  15.4  42.98
# 2:   4.0   2.00
# 3:  25.0 120.00

carsx[ , lapply(.SD, function(x) c(mean(x), min(x), max(x))), 
      .SDcols = "speed"]
#    speed
# 1:  15.4
# 2:   4.0
# 3:  25.0
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
1

Building on Svens answer a combination of .SDcols, rbindlist, and outer and inner lapply did the trick. The inner lapply is necessary to access .SD.

lapply( list(speed="speed",dist= "dist"),
    function(x)  carsx[ , rbindlist(lapply(.SD, function(x) list(mean=mean(x), min=min(x), max=max(x)) )), 
                       .SDcols = x,by= (dist>50)] ) 

Result:

$speed
    dist     mean min max
1: FALSE 12.96970   4  20
2:  TRUE 20.11765  14  25

$dist
    dist     mean min max
1: FALSE 27.84848   2  50
2:  TRUE 72.35294  52 120
Julian
  • 741
  • 8
  • 19