1

In the help page for special-symbols in data.table, it says ".N can be used in i as well." How do I do that?

For example, I would expect the following code to keep only rows where there is one element in the group.

> library(data.table)
> set.seed(734)
> dt <- data.table(x = c(rep("a", 5), rep("b", 3), "c", "d", "e"),
                   y = runif(11))
> dt
    x          y
 1: a 0.46431448
 2: a 0.57148294
 3: a 0.30197960
 4: a 0.06394102
 5: a 0.08793526
 6: b 0.62994539
 7: b 0.64693916
 8: b 0.79671939
 9: c 0.60865117
10: d 0.86025196
11: e 0.21562992

> dt[.N == 1, .(y), by = .(x)]
Empty data.table (0 rows) of 2 cols: x,y

I would have expected this to have the same result as:

> dt[, .(n = .N, y = y), by = .(x)][n == 1, .(x, y)]
   x         y
1: c 0.6086512
2: d 0.8602520
3: e 0.2156299

If not like the example above, how would I use .N in i for data.table?

Henrik
  • 65,555
  • 14
  • 143
  • 159
Jake Fisher
  • 3,220
  • 3
  • 26
  • 39
  • 4
    Please find the general form of `data.table` syntax in **Details** section in `?data.table` and in the [Introduction vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html), **Basics** section: "Take DT, subset/reorder rows by `i`, _then_ compute `j` grouped by `by`." Thus, _first_ indexing in `i`, _then_ calculate/select` in `j`. E.g. in your `dt[.N == 1, .(y), by = .(x)]`, you _first_ subset rows in `i` using the logical condition `.N == 1`. `.N` is 11 so this evaluates to `FALSE` and zero rows are selected in `i`... – Henrik Jul 03 '19 at 22:05
  • 3
    ... _then_ you try to "do things" in `j` on these zero rows. So use `.N` in `i` as you want, but remember the general form of the `data.table` syntax: "Take DT, subset/reorder rows by `i`, _then_ compute `j` grouped by `by`." – Henrik Jul 03 '19 at 22:06
  • 3
    For your actual example, please see the use of `.N` in [Subset data frame based on number of rows per group](https://stackoverflow.com/a/20204630/1851712). – Henrik Jul 03 '19 at 22:19
  • 1
    Thanks @Henrik. It looks like the answer is that I didn't understand the order of operations. `data.table` does `i`, then `by`, then `j`. As a result, using `.N` in `i` doesn't reflect what's in `by`. – Jake Fisher Jul 05 '19 at 16:56
  • 1
    Correct. That's true for _whatever_ you put in `i`, not only `.N`. Just read on in the (long but) excellent `?data.table` and play around with simple examples. Another relevant quote (from the `by` argument): "The data.table is then grouped by the `by` and `j` is evaluated within each group." - thus, it is **`j`** which is evaluated by group, not `i`. Good luck! Cheers – Henrik Jul 05 '19 at 17:00

1 Answers1

3

The .N based logical expression is not used in the i. Instead, get the row index (.I) from the expression in j, extract ($V1) the indices and subset the rows

dt[dt[, .I[.N == 1], by = .(x)]$V1]
#   x         y
#1: c 0.6086512
#2: d 0.8602520
#3: e 0.2156299

Also, the expression can be used to subset the .SD (could be slow)

dt[, .SD[.N == 1], .(x)]

Regarding the usage of ?.N,

.SD, .BY, .N, .I and .GRP are read only symbols for use in j. .N can be used in i as well.

But, it didn't mention on what context. If we use only i expression

dt[.N > 2] # works

Or i and j, it works

dt[.N > 2, .(x)]

To understand how the functions are called use verbose = TRUE

dt[.N ==1, .SD, by = .(x), verbose = TRUE]
#i clause present and columns used in by detected, only these subset: x 
#lapply optimization changed j from '.SD' to 'list(y)'
#Old mean optimization is on, left j unchanged.
#Making each group and running j (GForce FALSE) ... 
#  memcpy contiguous groups took 0.000s for 1 groups
#  eval(j) took 0.000s for 1 calls
#0.046s elapsed (0.268s cpu) 
#Empty data.table (0 rows and 2 cols): x,y

dt[dt[, .I[.N == 1], by = .(x), verbose = TRUE]$V1]
#Detected that j uses these columns: <none> 
#Finding groups using forderv ... 0.032s elapsed (0.033s cpu) 
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.033s #elapsed (0.194s cpu) 
#lapply optimization is on, j unchanged as '.I[.N == 1]'
#GForce is on, left j unchanged
#Old mean optimization is on, left j unchanged.
#Making each group and running j (GForce FALSE) ... dogroups: growing from 0 to #2 rows
#dogroups: growing from 2 to 4 rows
#Wrote less rows (3) than allocated (4).

#  memcpy contiguous groups took 0.000s for 5 groups
#  eval(j) took 0.000s for 5 calls
0.046s elapsed (0.273s cpu) 
akrun
  • 874,273
  • 37
  • 540
  • 662