data.table select first row of each group limited to n-1 columns?

Question

The data.table help for the ".SD" function shows how to select the first row of each group:

DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT

DT[, .N, by=x]                         # number of rows in each group

This mostly works well for me, but it breaks when I use all columns to define the group and I don't see why, so I wonder if it's a bug. Example:

# Selecting by n-1 columns works:
DT[, .SD[1], by=c("x", "y", "v", "a")]
   x y v a b
1: b 1 1 1 9
2: b 3 1 2 8
3: b 6 1 3 7
4: a 1 2 4 6
5: a 3 2 5 5
6: a 6 1 6 4
7: c 1 1 7 3
8: c 3 2 8 2
9: c 6 2 9 1

# The result of selecting by all columns is not what I expected:
DT[, .SD[1], by=c("x", "y", "v", "a", "b")]
Empty data.table (0 rows) of 5 cols: x,y,v,a,b

I was actually trying to deduplicate by row, so ended up here: https://stackoverflow.com/questions/11792527/filtering-out-duplicated-non-unique-rows-in-data-table But I'm still wondering if this is a bug. — zkurtz, Aug 25 '17 at 21:08
`.SD` are all the columns of the data.table excluding the group by columns. If you group by all columns, there is no `.SD` any more. — maccruiskeen, Aug 25 '17 at 21:10
One option is to create a function to take care of these cases `fsd <- function(dt, grps, n) { if(length(grps) == ncol(dt)) { as.data.table(dt)[, tmp := seq_len(.N)][, .(tmp = head(tmp, n)),by = c(grps)][, tmp := NULL][] } else as.data.table(dt)[, head(.SD, n), by = c(grps)] }; fsd(DT, names(DT), 2)` — akrun, Aug 25 '17 at 21:33

score 3 · Accepted Answer · answered Aug 25 '17 at 22:23

As commented by @christoph, .SD doesn't include group columns (which I believe is for efficiency purpose so as not to store duplicated group values), you can verify it by doing this:

unique(DT[, .(name = names(.SD)), by=c('x','v')]$name)
# [1] "y" "a" "b"

unique(DT[, .(name = names(.SD)), by=c('x','v','a')]$name)
# [1] "y" "b"

So if you group by all columns, .SD has nothing in it; And for your specific case, you can just use unique and pass the group variables to the by parameter, which will drop duplicates based on the by columns:

unique(DT, by=c('x','v'))

#   x v y a b
#1: b 1 1 1 9
#2: a 2 1 4 6
#3: a 1 6 6 4
#4: c 1 1 7 3
#5: c 2 3 8 2

unique(DT, by=c('x','v','y','a','b'))

#   x v y a b
#1: b 1 1 1 9
#2: b 1 3 2 8
#3: b 1 6 3 7
#4: a 2 1 4 6
#5: a 2 3 5 5
#6: a 1 6 6 4
#7: c 1 1 7 3
#8: c 2 3 8 2
#9: c 2 6 9 1

Or just `unique(DT)` as this using all columns in `by` is the default behavior. — David Arenburg, Aug 26 '17 at 22:49

data.table select first row of each group limited to n-1 columns?

1 Answers1

Related