How to select the n first rows for each factor in a datatable?

Question

I would like to select the first few rows for each factor in a datatable.

SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:3)), each=5), VALUE=sample(c(TRUE,FALSE), 5*3, TRUE) )
> SOURCE
     NAME VALUE
 1: NAME1  TRUE
 2: NAME1  TRUE
 3: NAME1  TRUE
 4: NAME1 FALSE
 5: NAME1 FALSE
 6: NAME2  TRUE
 7: NAME2 FALSE
 8: NAME2  TRUE
 9: NAME2  TRUE
10: NAME2  TRUE
11: NAME3  TRUE
12: NAME3 FALSE
13: NAME3 FALSE
14: NAME3  TRUE
15: NAME3  TRUE

For instance here I'd like to select the 3 first rows for each NAME so I would end up with rows : 1-3, 6-9 and 11-13. Any idea how to do that ?

I tried this but it doesn't work :

> SOURCE[1:3, VALUE, by=NAME]
    NAME VALUE
1: NAME1  TRUE
2: NAME1  TRUE
3: NAME1  TRUE

`SOURCE[, head(.SD, 3), by=NAME]` ? (also: `set.seed()` is your friend for reproducibility) — hrbrmstr, May 29 '16 at 02:54

score 4 · Answer 1 · answered May 29 '16 at 03:14

4

We can try with row indexing (.I) as well to subset.

SOURCE[SOURCE[, .I[1:3], by = NAME]$V1]

answered May 29 '16 at 03:14

akrun

874,273
37
540
662

score 3 · Accepted Answer · answered May 29 '16 at 03:14

3

This looks like it should do it. Basically the same thing as @hrbrmstr's answer in the comments, but doesn't use the head function:

set.seed(1)
SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:3)), each=5), VALUE=sample(c(TRUE,FALSE), 5*3, TRUE) )

SOURCE[,.SD[1:3], by=NAME]
    NAME VALUE
1: NAME1  TRUE
2: NAME1  TRUE
3: NAME1 FALSE
4: NAME2 FALSE
5: NAME2 FALSE
6: NAME2 FALSE
7: NAME3  TRUE
8: NAME3  TRUE
9: NAME3 FALSE

answered May 29 '16 at 03:14

Mike H.

13,960
2
29
39

1

For what it's worth, optimization is planned for `.SD[int_vec]` but not for `head(.SD, n)`, looks like https://github.com/Rdatatable/data.table/issues/735 – Frank May 29 '16 at 13:03

How to select the n first rows for each factor in a datatable?

2 Answers2