5

Using data.table, say I'm setting the key using two columns, and one of the columns has missing values. Data table seems to sort the NA values to the first values.

require(data.table)
set.seed(919)

# Create sample data
dt <- data.table(
  key1 = rep(1:10, each = 10),
  key2 = rep_len(letters, 100)
  )

# Set some key2 values to missing
dt[sample(1:100, 10), "key2"] <- NA

# Set key (sort)
setkeyv(dt, c("key1", "key2"))
dt
# 1:    1   NA
# 2:    1    a
# 3:    1    b
# 4:    1    c
# 5:    1    d
# 6:    1    f
# 7:    1    g
# 8:    1    h
# 9:    1    i
# 10:    1    j
# 11:    2   NA
# 12:    2   NA
# 13:    2    k
# 14:    2    m
# 15:    2    n
# 16:    2    o
# 17:    2    p
# 18:    2    q
# 19:    2    r
# 20:    2    s
# 21:    3    a
# 22:    3    b
# 23:    3    c
# 24:    3    d
# 25:    3    u
# 26:    3    v
# 27:    3    w
# 28:    3    x
# 29:    3    y
# 30:    3    z
# 31:    4    e
# 32:    4    f
# 33:    4    g
# 34:    4    h
# 35:    4    i
# 36:    4    j
# 37:    4    k
# 38:    4    l
# 39:    4    m
# 40:    4    n
# 41:    5   NA
# 42:    5   NA
# 43:    5    o
# 44:    5    q
# 45:    5    r
# 46:    5    s
# 47:    5    u
# 48:    5    v
# 49:    5    w
# 50:    5    x
# 51:    6   NA
# 52:    6    a
# 53:    6    b
# 54:    6    c
# 55:    6    d
# 56:    6    e
# 57:    6    g
# 58:    6    h
# 59:    6    y
# 60:    6    z
# 61:    7    i
# 62:    7    j
# 63:    7    k
# 64:    7    l
# 65:    7    m
# 66:    7    n
# 67:    7    o
# 68:    7    p
# 69:    7    q
# 70:    7    r
# 71:    8   NA
# 72:    8   NA
# 73:    8    a
# 74:    8    b
# 75:    8    t
# 76:    8    u
# 77:    8    w
# 78:    8    x
# 79:    8    y
# 80:    8    z
# 81:    9   NA
# 82:    9    c
# 83:    9    d
# 84:    9    e
# 85:    9    f
# 86:    9    h
# 87:    9    i
# 88:    9    j
# 89:    9    k
# 90:    9    l
# 91:   10   NA
# 92:   10    m
# 93:   10    n
# 94:   10    o
# 95:   10    p
# 96:   10    r
# 97:   10    s
# 98:   10    t
# 99:   10    u
# 100:   10    v
# key1 key2

Does this always happen, or will I run into problems if I always assume this is true?

Jake Fisher
  • 3,220
  • 3
  • 26
  • 39
  • 3
    Yes, `data.table` behaves like base R `sort(x,na.last=FALSE)` wherever it uses sorting, and I think the authors are very likely to stick with this convention. If you are just looking to sort your data, you should consider `setorder`. Setting a key has other uses, with sorting just being a side effect. By the way, the standard syntax there is `dt[sample(1:100, 10),key2:=NA]` and you should watch out for mistaking the two-character string `"NA"` for `NA` (not a problem in your example). – Frank May 21 '15 at 21:38
  • 3
    Oh, just came across this possible rationale: "NAs are internally large negative number[s]", which may explain the sorting. https://github.com/Rdatatable/data.table/issues/434 – Frank May 21 '15 at 21:43
  • 1
    @Frank can you put your comments as answer? so we can get rid off unanswered question on the tag :) – jangorecki Jul 18 '15 at 00:12
  • @jangorecki Okay, done. There's probably a better reference for NAs being large negative numbers, but this'll do, I guess. Thanks. – Frank Jul 18 '15 at 00:19

1 Answers1

4

For setkey(), data.table behaves like base R sort(x, na.last=FALSE), as the sort order (always increasing) is essential for binary search based joins/subsets. Rationale for NAs appearing first is that:

"NAs are internally large negative number[s]" github.com/Rdatatable/data.table/issues/434


Miscellaneous comments: If you are just looking to reorder your data, you should consider setorder(), which is capable of sorting in any order and positioning NAs in the beginning or end.

By the way, the standard syntax there is dt[sample(1:100, 10), key2 := NA] and you should watch out for mistaking the two-character string "NA" for NA (not a problem in your example).

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180