2

I have a data table:

> COUNT_ID_CATEGORY
                id 706 799 1703 1726 2119 2202 3203 3504 3509 4401 4517 5122 5558 5616 5619 5824 6202 7205 9115 9909
     1:      86246   9   0   15    4   28    0   15   63   39    5    7   25   27   43   12   64    1   16    0   96
     2:      86252   3   0   17    6   21    0    6   62   24    6    7   12   25   32    6   49    1   26    0  103
     3:   12262064   3   0    1    1   12    0    0    2    1    0    0    0    2    4    0    4    0    0    0   12
     4:   12277270   2   0    0    0    1    0    3    0    3    0    0    0    0   24    0    6    2    5    0   60
     5:   12332190   2   0    2    0    4    0    1    2    0    0    0    1    0    3    0    1    3    2    0   46
    ---                                                                                                             
310661: 4837642552   0   0    0    0    0    0    1    0    0    0    0    0    0    0    0    1    0    0    1    0
310662: 4843417324   0   0    0    0    0    0    0    0    0    0    0    0    0    0    1    2    0    0    0    0
310663: 4847628950   2   0    1    1   16    0    0    2    3    0    0    2    9    5    0    3    3    2    3   14
310664: 4847787712   0   0    0    0    1    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0
310665: 4853598737   0   0    0    0    0    0    0    0    0    0    0    0    1    0    0    1    0    0    0    0
> class(COUNT_ID_CATEGORY)
[1] "data.table" "data.frame"
>

and I wish to read the data as quickly as possible as follows:

COUNT_ID_CATEGORY for (id == 86246) & (category == 706)

which should return the value 9 (top left in the table). (for example)

I can get the row with:

COUNT_ID_CATEGORY[id==86246,]

but how do I get the column?

> dput(head(COUNT_ID_CATEGORY))
structure(list(id = c(86246, 86252, 12262064, 12277270, 12332190, 
12524696), `706` = c(9L, 3L, 3L, 2L, 2L, 0L), `799` = c(0L, 0L, 
0L, 0L, 0L, 0L), `1703` = c(15L, 17L, 1L, 0L, 2L, 0L), `1726` = c(4L, 
6L, 1L, 0L, 0L, 0L), `2119` = c(28L, 21L, 12L, 1L, 4L, 0L), `2202` = c(0L, 
0L, 0L, 0L, 0L, 0L), `3203` = c(15L, 6L, 0L, 3L, 1L, 0L), `3504` = c(63L, 
62L, 2L, 0L, 2L, 11L), `3509` = c(39L, 24L, 1L, 3L, 0L, 3L), 
    `4401` = c(5L, 6L, 0L, 0L, 0L, 1L), `4517` = c(7L, 7L, 0L, 
    0L, 0L, 1L), `5122` = c(25L, 12L, 0L, 0L, 1L, 0L), `5558` = c(27L, 
    25L, 2L, 0L, 0L, 1L), `5616` = c(43L, 32L, 4L, 24L, 3L, 18L
    ), `5619` = c(12L, 6L, 0L, 0L, 0L, 0L), `5824` = c(64L, 49L, 
    4L, 6L, 1L, 10L), `6202` = c(1L, 1L, 0L, 2L, 3L, 6L), `7205` = c(16L, 
    26L, 0L, 5L, 2L, 4L), `9115` = c(0L, 0L, 0L, 0L, 0L, 0L), 
    `9909` = c(96L, 103L, 12L, 60L, 46L, 1L)), .Names = c("id", 
"706", "799", "1703", "1726", "2119", "2202", "3203", "3504", 
"3509", "4401", "4517", "5122", "5558", "5616", "5619", "5824", 
"6202", "7205", "9115", "9909"), sorted = "id", class = c("data.table", 
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x043a24a0>)
Timothée HENRY
  • 14,294
  • 21
  • 96
  • 136

1 Answers1

3

First setkey for fast lookup using data.table's binary search/subset feature:

setkey(COUNT_ID_CATEGORY, id)

Then you can do:

COUNT_ID_CATEGORY[J(86246)][, '706']

The first part COUNT_ID_CATEGORY[J(86246)] performs fast subset using binary search. You can read more about J(.) and what it does here.

The next part [, '706', with=FALSE] takes the subset result, which is a data.table and selects just the column 706.

Just to be complete, this post shows more ways of selecting/subsetting columns from a data.table.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Gary Weissman
  • 3,557
  • 1
  • 18
  • 23
  • @Arun thanks for the edits, great explanation. On my machine there is no speed difference between `dt[J(i)]` and `dt[i]`, but on a large sample my laptop runs `dt[i,j,with=F]` faster than `dt[i][,j,with=F]` by a factor of about 1.5, although I can see for learning purposes why the latter is more clear. – Gary Weissman May 05 '14 at 17:29
  • 2
    GaryWeissman, `DT[J(86246)]` and `DT[86246]` are very different things. The first subsets the row where key column matches 86246. The second subsets the 86246th row. – Arun May 05 '14 at 17:56
  • 1
    For your second question, data.table v <=1.9.2 uses implicit `by-without-by` which means `DT[J(.), j, ..]` executes `j` for each match. That's why I joined first and then did a column subset. In 1.9.3+, the default behaviour will be a normal join - so it'll be fast. If one requires a by-without-by, then we'll have to explicitly state it as `by=.EACHI`. Check the NEWS of 1.9.3 for more info (and the datatable mailing list for discussion). – Arun May 05 '14 at 17:59