4

I am trying to do a min/max aggregate on a dynamically chosen column in a data.table. It works perfectly for numeric columns but I cannot get it to work on Date columns unless I create a temporary data.table.

It works when I use the name:

dt <- data.table(Index=1:31, Date = seq(as.Date('2015-01-01'), as.Date('2015-01-31'), by='days'))
dt[, .(minValue = min(Date), maxValue = max(Date))]
# minValue   maxValue
# 1: 2015-01-01 2015-01-31

It does not work when I use with=FALSE:

colName = 'Date'
dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)),  : 
# could not find function "."

I can use .SDcols on a numeric column:

colName = 'Index'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
#   minValue maxValue
#  1:        1       31

But I get an error when I do the same thing for a Date column:

colName = 'Date'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) : 
#   only defined on a data frame with all numeric variables

If I use lapply(.SD, min) or sapply() then the dates are changed to numbers.

The following works and does not seem to waste memory and is fast. Is there anything better?

a <- dt[, colName, with=F]
setnames(a, 'a')
a[, .(minValue = min(a), maxValue = max(a))]
Arun
  • 116,683
  • 26
  • 284
  • 387
Jim Cutler
  • 63
  • 4
  • Short answer: Use `.SD[[1]]` because `.SD` is a list of vectors and you want to use the first vector in it. Long answer: you seem to have found some weird loophole where a data.frame of all numeric cols can have its `max` and `min` taken; and I wouldn't build on that weird exception. – Frank Nov 08 '15 at 00:33

1 Answers1

2

On your first attempt:

dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)),  : 
# could not find function "."

You should simply read the Introduction to data.table vignette to understand what with= means. It's easier if you're aware of with() function from base R.

On the second one:

dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) : 
#   only defined on a data frame with all numeric variables

This seems like an issue with min() and max() on a data.frame/data.table with column with attributes. Here's a MRE.

df = data.frame(x=as.Date("2015-01-01"))
min(df)
# Error in FUN(X[[i]], ...) : 
#   only defined on a data frame with all numeric variables

To answer your question, you can use get():

dt[, .(min = min(get(colName)), max = max(get(colName)))]

Or as @Frank suggested, [[ operator to subset the column:

dt[, .(min = min(.SD[[colName]]), max = max(.SD[[colName]]))]

There's not yet a nicer way of applying .SD to multiple functions (because base R doesn't seem to have one AFAICT, and data.table tries to use base R functions as much as possible). There's a FR #1063 to address this issue. If/when that gets implemented, then one could do, for example:

# NOTE: not yet implemented, FR #1063
dt[, colwise(.SD, min, max), .SDcols = colName]
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thinking about that colwise, ... a clumsy workaround for the OP's case is `dcast(dt, .~., value.var="Date", fun=list(min,max))` or `dcast(dt, .~., value.var=names(dt), fun=list(min,max))` to do both vars at once. – Frank Nov 08 '15 at 12:30