1

If I pass the variable bloodpressure to data.table, everything works fine.

tdt <- data.table(bloodpressure = rnorm(1000, mean=100, sd=15), male=rep(c(0,1)))
strata.var <- with(tdt, get(c('male')))

tdt[,list(
            varname='bloodpressure',
            N=.N,
            mean=mean(bloodpressure, na.rm=TRUE),
            sd=sd(bloodpressure, na.rm=TRUE)
            ),
        by=(strata.var)]

I get this result

   strata.var       varname   N     mean       sd
1:          0 bloodpressure 500 100.2821 15.13686
2:          1 bloodpressure 500 100.0392 15.02566

Which matches the group means

> mean(tdt$bloodpressure[tdt$male==0])
[1] 100.2821
> mean(tdt$bloodpressure[tdt$male==1])
[1] 100.0392

But if I am trying to do this programmatically, and the variable is stored in another variable (var)

var_as_string <- 'bloodpressure'
var <- with(tdt, get(var_as_string))

tdt[,list(
            varname='bloodpressure',
            N=.N,
            mean=mean(var, na.rm=TRUE),
            sd=sd(bloodpressure, na.rm=TRUE)
            ),
        by=(strata.var)]

I get a different result.

   strata.var       varname   N     mean       sd
1:          0 bloodpressure 500 100.1606 15.13686
2:          1 bloodpressure 500 100.1606 15.02566

Notice now mean is identical (i.e. calculated across the whole sample not by group.

> mean(tdt$bloodpressure)
[1] 100.1606
drstevok
  • 715
  • 1
  • 6
  • 15
  • @Arun I don't think I understand the scoping issue ... I have assigned `bloodpressure` to `var`, but this is clearly incorrect/inadequate, and `data.table` is only seeing `var`. I'll do some more reading ... ;) – drstevok Oct 10 '14 at 15:07
  • I'm not sure yet, but I think you may have discovered a bug here.. Testing. Will write back :-). – Arun Oct 10 '14 at 15:18
  • Yes it's a bug, as I suspected. Filed [#875](https://github.com/Rdatatable/data.table/issues/875). Thanks! – Arun Oct 10 '14 at 15:53

2 Answers2

2

You can replace mean=mean(var, na.rm=TRUE), with mean=mean(get(var_as_string), na.rm=TRUE) and then it should work - otherwise it just uses the numeric vector in var rather than the data table column you want it to use (and returns mean(var) for both subgroups).

library(data.table)
set.seed(1)
tdt <- data.table(bloodpressure = rnorm(1000, mean=100, sd=15), male=rep(c(0,1)))
strata.var <- with(tdt, get(c('male')))

tdt[,list(
        varname='bloodpressure',
        N=.N,
        mean=mean(bloodpressure, na.rm=TRUE),
        sd=sd(bloodpressure, na.rm=TRUE)
        ),
    by=(strata.var)]        
#   strata.var       varname   N      mean       sd
#1:          0 bloodpressure 500  99.58425 15.55735
#2:          1 bloodpressure 500 100.06630 15.50188

var_as_string <- 'bloodpressure'

tdt[,list(
        varname='bloodpressure',
        N=.N,
        mean=mean(get(var_as_string), na.rm=TRUE),
        sd=sd(bloodpressure, na.rm=TRUE)
        ),
    by=(strata.var)]                
#   strata.var       varname   N      mean       sd
#1:          0 bloodpressure 500  99.58425 15.55735
#2:          1 bloodpressure 500 100.06630 15.50188
konvas
  • 14,126
  • 2
  • 40
  • 46
  • Not that I know any better, but is this preferable to the `.SDcols` approach I came across and posted below? – drstevok Oct 10 '14 at 15:16
  • For one it seems more straightforward, so I would prefer that. I just compared the speed and it seems marginally faster too, but I'd just pick whichever syntax seems easier... – konvas Oct 10 '14 at 15:27
  • Cheers. I have switched to the `.SDcols` method because it seems to generalise better - at least in my hands ;) - and fixed a similar problem I was having with the `by` clause. – drstevok Oct 10 '14 at 15:33
1

OK. With much help from this excellent post, I think I have an answer ...

colVars <- c('bloodpressure')
byCols <- c('male')
tdt[, lapply(.SD, function(x) mean=mean(x)), .SDcols = colVars, by=byCols]
tdt[, list(
    mean = lapply(.SD, function(x) mean(x)),
    sd = lapply(.SD, function(x) sd(x))
    ), .SDcols = colVars, by=byCols]

The trick is to use .SD, .SDcols, and the to wrap everything in lapply.

Why is it, that, despite searching for ages, it is only after spending a another block of time crafting a question that I manage to find the answer? A question for https://meta.stackoverflow.com/ ...

Community
  • 1
  • 1
drstevok
  • 715
  • 1
  • 6
  • 15