8

The following code segfaults my R 2.15.0, running data.table 1.8.9.

library(data.table)
d = data.table(date = c(1,2,3,4,5), value = c(1,2,3,4,5))

# works as expected
d[-5][, mean(value), by = list(I(as.integer((date+1)/2)))]

# crashes R
d[-5, mean(value), by = list(I(as.integer((date+1)/2)))]

And on a related note, the following two commands have very different outputs:

d[-5][, value, by = list(I(as.integer((date+1)/2)))]
#    I value
# 1: 1     1
# 2: 1     2
# 3: 2     3
# 4: 2     4

d[-5, value, by = list(I(as.integer((date+1)/2)))]
#    I         value
# 1: 1 2.121996e-314
# 2: 1 2.470328e-323
# 3: 2 3.920509e-316
# 4: 2 2.470328e-323

Simpler command crashing my R from the comments:

d[-5, value, by = date]

As Ricardo points out, it's the combination of negative indexing and by that creates the problem.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    Also crashes my R-3.0.0 (with the same version of **data.table**) on a Windows XP box. – Josh O'Brien Apr 16 '13 at 20:45
  • Doesn't crash for me, but gives the different result. R-3.0.0. `data.table version 1.8.8`. – Arun Apr 16 '13 at 20:48
  • Seems to be a problem with negative indexing. For example: `d[-5, date := 4:1]` gives this: `Warning message`: In `[.data.table`(d, -5, `:=`(date, 4:1)) : Supplied 4 items to be assigned to 1 items of column 'date' (3 unused)` – Arun Apr 16 '13 at 20:49
  • Maybe @MatthewDowle could weigh in sometime? – Arun Apr 16 '13 at 20:58
  • like @Arun, it did *not* crash for me, but did get different results. I'm on `R 2.15.3, data.table 1.8.8, Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)` – Ricardo Saporta Apr 16 '13 at 21:03
  • 1
    Have a look at this: `d[-5, .SD, by=value]; d[-5, .SD]; d[1:4, .SD, by=value]; `. It looks like it may not be related at all to the complexity of the by argument, but simply its presence along with a negative index. also: `d[-3, .SD, by=date]` – Ricardo Saporta Apr 16 '13 at 21:05
  • this caused a crash for me: `d[-3, .SD, .SDcols="value", by=date]` – Ricardo Saporta Apr 16 '13 at 21:08
  • 2
    Gives a Segmentation Fault for me - you should always be clear when saying 'it crashes', since sometimes people say that when all they get is an error message. – Spacedman Apr 16 '13 at 22:55
  • @Spacedman - you seem to be the first person to think that, but ok, changed :) – eddi Apr 16 '13 at 23:13
  • 1
    @Spacedman was the first to state it, he definitely wasn't the first to think it. – mnel Apr 16 '13 at 23:19

2 Answers2

4

One hypothesis is that the problem is related to the following lines in [.data.table:

o__ = if (length(o__)) irows[o__]
              else irows

o__ eventually gets passed to the C code (dogroups.C) as -5 in this case. One could imagine this causing issues with pointer arithmetic leading to segfaults and/or erroneous values.

A potential workaround would be to use data.table's not-join syntax:

d[!5, mean(value), by = list(I(as.integer((date+1)/2)))]

which passes through some different logic on the way to C:

if (notjoin) {
            ... Omitted for brevity ...
            i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL
        }
user1935457
  • 721
  • 4
  • 7
  • thanks, for the record, `d[!5, ...]` is a little slower in my tests than `d[-5][, ...]` – eddi Apr 17 '13 at 23:30
  • 1
    Filed as [#2697](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2697&group_id=240&atid=975). Thanks @eddi and user1935457. – Matt Dowle Apr 19 '13 at 18:01
4

UPDATE: This has been fixed in v1.8.11. From NEWS :

Crash or incorrect aggregate results with negative indexing in i is fixed, #2697. Thanks to Eduard Antonyan (eddi) for reporting. Tests added.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Arun
  • 116,683
  • 26
  • 284
  • 387