5

Is it possible to store the order of rows in a data.table while preserving its keys?

Lets say I have the following dummy table:

library(data.table)
dt <- data.table(id=letters[1:6], 
                   group=sample(c("red", "blue"), replace=TRUE), 
                   value.1=rnorm(6), 
                   value.2=runif(6))
setkey(dt, id)
dt
   id group    value.1    value.2
1:  a  blue  1.4557851 0.73249612
2:  b   red -0.6443284 0.49924102
3:  c  blue -1.5531374 0.72977197
4:  d   red -1.5977095 0.08033604
5:  e  blue  1.8050975 0.43553048
6:  f   red -0.4816474 0.23658045

I would like to store this table so that rows are ordered by group, and by value.1 in decreasing order, i.e:

> dt[order(group, value.1, decreasing=T),]
   id group    value.1    value.2
1:  f   red -0.4816474 0.23658045
2:  b   red -0.6443284 0.49924102
3:  d   red -1.5977095 0.08033604
4:  e  blue  1.8050975 0.43553048
5:  a  blue  1.4557851 0.73249612
6:  c  blue -1.5531374 0.72977197

Obviously I can save this as a new variable, but I also want to keep the id column as my primary key.

Arun's answer to "What is the purpose of setting a key in data.table?" suggests that this can be achieved with clever use setkey, since it orders the data.table in the order of its keys (although there is no option to set the key to decreasing order):

> setkey(dt, group, value.1, id)
> dt
   id group    value.1    value.2
1:  c  blue -1.5531374 0.72977197
2:  a  blue  1.4557851 0.73249612
3:  e  blue  1.8050975 0.43553048
4:  d   red -1.5977095 0.08033604
5:  b   red -0.6443284 0.49924102
6:  f   red -0.4816474 0.23658045

However, I lose the ability to use id as my primary key, because group is the first key provided:

> dt["a"]
   group id value.1 value.2
1:     a NA      NA      NA
smci
  • 32,567
  • 20
  • 113
  • 146
Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64
  • Not an exact duplicate of http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search/15597713, however I think this will only be possible when [FR#1007](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1007&group_id=240&atid=978) is implemented (secondary keys) – mnel May 16 '14 at 01:41
  • Maybe I'm not understanding it right, but this example doesn't really need sorting in decreasing order, isn't it? All values in `id` are unique... – Arun May 16 '14 at 01:50
  • @mnel I did try to follow that chain of questions, but it wasn't clear to me whether secondary keys would have the same issue. – Scott Ritchie May 16 '14 at 01:54
  • 1
    @Arun I want to sort by `value.1` within each group, but want to keep `id` as the primary key. Does that make sense? – Scott Ritchie May 16 '14 at 01:58
  • @ScottRitchie, yes. Thanks. Could you elaborate a bit on the type of problem for which you're requiring this (subset based on different keys)? – Arun May 16 '14 at 02:13
  • 1
    It's mostly aesthetic. I'm looking at some group of genes, and I'm ranking them within each group by network connectivity, so it's useful for me to be able to quickly see the `data.table` in that order when talking to my supervisor and collaborators. – Scott Ritchie May 16 '14 at 02:30
  • Essentially, I want the `data.table` to behave exactly like `dt; setkey(dt, id)`, but for `show`/`print` to behave as if I'd called `dt[order(group, value.1, decreasing=TRUE)]`. My solution at the moment is to save two tables: one which is keyed, and one which is stored in the order I want. – Scott Ritchie May 16 '14 at 02:32
  • 1
    It is not obvious from your question should the `id` be the primary key. If not, you can still search the table by `id` even `id` is not a key: `dt[id == "a"]` – djhurio May 20 '14 at 12:02
  • A workaround to get decreasing order is to negate the key field: `dt[, Col := -Col]` – smci Apr 14 '15 at 15:03
  • @mnel: FR#1007 on R-forge is no longer accessible to us, but since dt project has moved to github I assume it was implemented, can you confirm? (and provide updated link) – smci Apr 23 '18 at 21:36

3 Answers3

3

Sounds like you simply want to modify print.data.table:

print.data.table = function(x, ...) {
  # put whatever condition identifies your tables here
  if ("group" %in% names(x) && "value.1" %in% names(x)) {
    data.table:::print.data.table(x[order(group, value.1, decreasing = T)], ...)
  } else {
    data.table:::print.data.table(x, ...)
  }
}

set.seed(2)
dt = data.table(id=letters[1:6], 
               group=sample(c("red", "blue"), replace=TRUE), 
               value.1=rnorm(6), 
               value.2=runif(6))
setkey(dt, id)
dt
#   id group     value.1    value.2
#1:  a   red  0.18484918 0.40528218
#2:  e   red  0.13242028 0.44480923
#3:  c   red -1.13037567 0.97639849
#4:  b  blue  1.58784533 0.85354845
#5:  f  blue  0.70795473 0.07497942
#6:  d  blue -0.08025176 0.22582546

dt["c"]
#   id group   value.1   value.2
#1:  c   red -1.130376 0.9763985
eddi
  • 49,088
  • 6
  • 104
  • 155
0

I think you can still search by id only, as follows:

dt[J(unique(group),unique(value.1),"a"), nomatch=0]
   group   value.1 id   value.2
1:  blue 0.4928595  a 0.3311728

from what I gathered unique(column_name) is the way to include all values for that column.

I am not sure if this helps.

Timothée HENRY
  • 14,294
  • 21
  • 96
  • 136
0

Building on @eddi's answer, I've created a hackish solution where I store an unevaluated call to order as an attribute of the data.table, which print.data.table obeys:

set_order <- function(dt, cols, decreasing=FALSE) {
  # Store a call to order as an additional attribute
  attr(dt, "order") <- paste0("order(", paste(cols, collapse=", "), 
                              ", decreasing=", decreasing, ")")
  invisible(dt)
}

print.data.table = function(x, ...) {
  if (!is.null(attr(x, "order"))) {
    # Use the stored ordering to print the data.table
    data.table:::print.data.table(x[eval(parse(text=attr(x, "order")))], ...)
  } else {
    data.table:::print.data.table(x, ...)
  }
}

Giving me the behaviour I want:

dt <- set_order(dt, c("group", "value.1"), decreasing=T)
dt
#    id group    value.1    value.2
# 1:  f   red -0.4816474 0.23658045
# 2:  b   red -0.6443284 0.49924102
# 3:  d   red -1.5977095 0.08033604
# 4:  e  blue  1.8050975 0.43553048
# 5:  a  blue  1.4557851 0.73249612
# 6:  c  blue -1.5531374 0.72977197

tables()
#      NAME NROW MB COLS                     KEY
# [1,] dt      6 1  id,group,value.1,value.2 id 
# Total: 1MB
Scott Ritchie
  • 10,293
  • 3
  • 28
  • 64
  • I noticed data.tables have a `"sorted"` attributed, however changing this doesn't affect the order the table is printed in. – Scott Ritchie May 21 '14 at 02:54