0

I tried hard to come up with a simple example demonstrating my issue. Unfortunately, I failed. So please apologize.

I am working with data.table 1.9.6 on a very large dataset with more than 100 million rows and nearly 60 columns. After doing several operations I face the following issue with column del which I try to explain giving the following outputs:

> data[, class(del)]
[1] "logical"

> data[, summary(del)]
   Mode     FALSE    TRUE   NA's
logical 124763883 2088978      0

So far so good, everything is fine. But then I noticed:

> nrow(data[del==TRUE])
[1] 0

> nrow(data[del==FALSE])
[1] 126790922

Seems that some of the entries are either TRUE nor FALSE but in fact somewhere in between 0 and 1:

> nrow(data[del<0.5])
[1] 124763883

> nrow(data[del>0.5])
[1] 2088978

But how can that happen? I only assigned TRUE and FALSE's to this column. Again, I would have loved to produce a little example but it does not work, in a sense that the issue disappears if I select a subset. For example if a select row 47639 (id is unique):

> data[id==47639, list(del)]
    del
1: TRUE

del seems to be set to TRUE. So if I first select this specific row and then test whether del is TRUE it works:

> data[id==47639][del==TRUE, list(del)]
    del
1: TRUE

But data[del==TRUE][id==47639, list(del)] produces no output.

Empty data.table (0 rows) of 1 col: del

[Edit] I have spent a considerable amount of time in reducing the size of my dataset to make a reproduceable example. It now has only 132 rows and 1 column and still the odd behaviour. This is what I get:

data[, summary(del)]
   Mode   FALSE    TRUE    NA's 
logical     129       3       0 
> nrow(data[del==TRUE])
[1] 1
> nrow(data[del==FALSE])
[1] 127
> nrow(data[del<0.5])
[1] 129
> nrow(data[del>0.5])
[1] 3

Using dput to post the sample here fails because:

> dput(data)
structure(list(del = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)), .Names = "del", class = c("data.table", 
"data.frame"), row.names = c(NA, -132L), .internal.selfref = <pointer: 0x1914ca8>, index = structure(integer(0), "`__del`" = integer(0)))

So dput and dget together produce:

> dput(data, "dt")
> dget("dt")
Error in parse(file = file, keep.source = keep.source) : 
  dt:16:62: unexpected '<'
15: FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)), .Names = "del", class = c("data.table", 
16: "data.frame"), row.names = c(NA, -132L), .internal.selfref = <

This looks like a bug to me. Of course, I could use the workaround data[(del)] but that means rewriting all my previous code to make it safe. What concerns me most is that I do not know which operation exactly corrupts data.

I could also provide a tiny RData file if that is of any help but I do not know how to post it here correctly.

Machavity
  • 30,841
  • 27
  • 92
  • 100
nisse
  • 1
  • 2
  • You tagged this with CRAN, but are you using the latest data.table version on CRAN (1.9.6)? There was a bug, earlier, that meant you had to write `data[(del==TRUE)]` with the parentheses to get the expected results. – Frank Nov 24 '15 at 16:26
  • @Frank, not just `data[(del)]`? – talat Nov 24 '15 at 16:28
  • @docendodiscimus D'oh, you are right. – Frank Nov 24 '15 at 16:29
  • Thanks to both of you. Yes, I am using the latest version of data.table, ie 1.9.6. I will try the workaround you mentioned. However, do you know where this bug was posted and whether it affects other column types as well? – nisse Nov 24 '15 at 16:46
  • It affected many cases of the form `DT[a == x]` or `DT[a %in% y]`, independent of column type. If you're seeing a problem like this on 1.9.6, maybe a reproducible example is necessary. Guidance on that is over here: http://stackoverflow.com/a/28481250/1191259 For the changelog related to this bug, see "Auto indexing" (item #6 in 1.9.6 Bug Fixes) here: https://github.com/Rdatatable/data.table#bug-fixes-1 – Frank Nov 24 '15 at 17:13
  • Are my additions above of any help to pin down the source of the problem? Shall I file a bug report now? Because It seems as if this "auto indexing" issue is still prevalent in data.table 1.9.6. – nisse Nov 26 '15 at 12:41

0 Answers0