2

I would like to select the first few rows for each factor in a datatable.

SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:3)), each=5), VALUE=sample(c(TRUE,FALSE), 5*3, TRUE) )
> SOURCE
     NAME VALUE
 1: NAME1  TRUE
 2: NAME1  TRUE
 3: NAME1  TRUE
 4: NAME1 FALSE
 5: NAME1 FALSE
 6: NAME2  TRUE
 7: NAME2 FALSE
 8: NAME2  TRUE
 9: NAME2  TRUE
10: NAME2  TRUE
11: NAME3  TRUE
12: NAME3 FALSE
13: NAME3 FALSE
14: NAME3  TRUE
15: NAME3  TRUE

For instance here I'd like to select the 3 first rows for each NAME so I would end up with rows : 1-3, 6-9 and 11-13. Any idea how to do that ?

I tried this but it doesn't work :

> SOURCE[1:3, VALUE, by=NAME]
    NAME VALUE
1: NAME1  TRUE
2: NAME1  TRUE
3: NAME1  TRUE
Frank
  • 66,179
  • 8
  • 96
  • 180
ChiseledAbs
  • 1,963
  • 6
  • 19
  • 33

2 Answers2

4

We can try with row indexing (.I) as well to subset.

SOURCE[SOURCE[, .I[1:3], by = NAME]$V1]
akrun
  • 874,273
  • 37
  • 540
  • 662
3

This looks like it should do it. Basically the same thing as @hrbrmstr's answer in the comments, but doesn't use the head function:

set.seed(1)
SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:3)), each=5), VALUE=sample(c(TRUE,FALSE), 5*3, TRUE) )

SOURCE[,.SD[1:3], by=NAME]
    NAME VALUE
1: NAME1  TRUE
2: NAME1  TRUE
3: NAME1 FALSE
4: NAME2 FALSE
5: NAME2 FALSE
6: NAME2 FALSE
7: NAME3  TRUE
8: NAME3  TRUE
9: NAME3 FALSE
Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • 1
    For what it's worth, optimization is planned for `.SD[int_vec]` but not for `head(.SD, n)`, looks like https://github.com/Rdatatable/data.table/issues/735 – Frank May 29 '16 at 13:03