3

I have the following code:

> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")
> dt
    a  b  c  d
 1: 3  1 11 21
 2: 3  2 12 22
 3: 3  3 13 23
 4: 3  4 14 24
 5: 3  5 15 25
 6: 4  6 16 26
 7: 4  7 17 27
 8: 4  8 18 28
 9: 4  9 19 29
10: 4 10 20 30
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))'
Starting dogroups ... done dogroups in 0 secs
   a  b  c   d
1: 3 15 65 115
2: 4 40 90 140
> dt[,c(count=.N,lapply(.SD,sum)),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))'
Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
done dogroups in 0 secs
   a count  b  c   d
1: 3     5 15 65 115
2: 4     5 40 90 140

How do I avoid the scary "very inefficient" warning?

I can add the count column before the join:

> dt$count <- 1
> dt
    a  b  c  d count
 1: 3  1 11 21     1
 2: 3  2 12 22     1
 3: 3  3 13 23     1
 4: 3  4 14 24     1
 5: 3  5 15 25     1
 6: 4  6 16 26     1
 7: 4  7 17 27     1
 8: 4  8 18 28     1
 9: 4  9 19 29     1
10: 4 10 20 30     1
> dt[,lapply(.SD,sum),by="a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))'
Starting dogroups ... done dogroups in 0 secs
   a  b  c   d count
1: 3 15 65 115     5
2: 4 40 90 140     5

but this does not look too elegant...

GSee
  • 48,880
  • 13
  • 125
  • 145
sds
  • 58,617
  • 29
  • 161
  • 278
  • 1
    You want to "suppress" the warning or do things efficiently? – Arun Apr 21 '13 at 15:27
  • 1
    I never said "suppress". I said "avoid" which means I want to do the right thing and make my code behave properly, efficiently, so that there is no need for the warning. – sds Apr 21 '13 at 16:23
  • Obviously I was not quite sure whether you want to "avoid" "seeing" the warning or "avoid" "having" that warning. – Arun Apr 21 '13 at 16:37
  • Are you using the latest version of `data.table` (the latest being 1.8.8)? I am not getting the warnings you are getting. – djhurio Apr 21 '13 at 16:55
  • @djhurio: yes, this is with 1.8.8. – sds Apr 21 '13 at 16:59
  • 2
    @djhuro, do this: `options(datatable.verbose = TRUE)` and then try the code. – Arun Apr 21 '13 at 17:04
  • @sds, are you not satisfied with the alternatives (in the edit) or do you have trouble with them as well? In any case, I'd posted a question on the comment you posted about "RHS gets duplicated" [**here**](http://stackoverflow.com/questions/16152161/understanding-optimisation-messages-on-assignment-by-reference-in-a-data-table) to which Matthew answered. – Arun Apr 23 '13 at 06:57
  • 1
    @Arun: thanks for your answer and for the question you asked on my behalf – sds Apr 23 '13 at 14:02

2 Answers2

3

One way I could think of is to assign count by reference:

dt.out <- dt[, lapply(.SD,sum), by = a]
dt.out[, count := dt[, .N, by=a][, N]]
# alternatively: count := table(dt$a)

#    a  b  c   d count
# 1: 3 15 65 115     5
# 2: 4 40 90 140     5

Edit 1: I still think it's just message and not a warning. But if you still want to avoid that, just do:

dt.out[, count := as.numeric(dt[, .N, by=a][, N])]

Edit 2: Very interesting. Doing the equivalent of multiple := assignment does not produce the same message.

dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# Detected that j uses these columns: a 
# Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0
# Detected that j uses these columns: <none> 
# Optimization is on but j left unchanged as '.N'
# Starting dogroups ... done dogroups in 0 secs
# Detected that j uses these columns: N 
# Assigning to all 2 rows
# Direct plonk of unnamed RHS, no copy.
Arun
  • 116,683
  • 26
  • 284
  • 387
  • this generates a warning "RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS." – sds Apr 21 '13 at 16:41
  • How do you say it's a warning? It doesn't say anything about inefficiency... It's just a message. In any case, I've made an edit to not get that message. – Arun Apr 21 '13 at 17:14
  • I think you may find `dt[, .N, by=a][['N']]` more efficient as it won't need to call the overhead of `[.data.table` when simply subsetting. – mnel Apr 21 '13 at 23:48
2

This solution removes the message about the named elements. But you have to put the names back afterwards.

require(data.table)
options(datatable.verbose = TRUE)

dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a")

dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]

Output

> dt[, c(.N, unname(lapply(.SD, sum))), by = "a"]
Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
Optimization is on but j left unchanged as 'c(.N, unname(lapply(.SD, sum)))'
Starting dogroups ... done dogroups in 0.001 secs
   a V1 V2 V3  V4
1: 3  5 15 65 115
2: 4  5 40 90 140
djhurio
  • 5,437
  • 4
  • 27
  • 48
  • Nice (and better) alternative. Having `.N` at the end makes it easier to set names later using `setnames(dt.out, c(names(dt), "N"))` (a bit easier). – Arun Apr 21 '13 at 17:45
  • This is *significantly* slower: `Starting dogroups ... done dogroups in 0.277 secs` vs `Starting dogroups ... done dogroups in 2.929 secs` – sds Apr 21 '13 at 17:53
  • I am comparing yours (slow) with either mine or @arun's (both fast) – sds Apr 21 '13 at 19:50
  • @djhurio, Trying on a big `data.table` (1e7 by 4 or more columns), I observe the same effect as sds. – Arun Apr 21 '13 at 20:55