I tried hard to come up with a simple example demonstrating my issue. Unfortunately, I failed. So please apologize.
I am working with data.table 1.9.6 on a very large dataset with more than 100 million rows and nearly 60 columns. After doing several operations I face the following issue with column del
which I try to explain giving the following outputs:
> data[, class(del)]
[1] "logical"
> data[, summary(del)]
Mode FALSE TRUE NA's
logical 124763883 2088978 0
So far so good, everything is fine. But then I noticed:
> nrow(data[del==TRUE])
[1] 0
> nrow(data[del==FALSE])
[1] 126790922
Seems that some of the entries are either TRUE
nor FALSE
but in fact somewhere in between 0
and 1
:
> nrow(data[del<0.5])
[1] 124763883
> nrow(data[del>0.5])
[1] 2088978
But how can that happen? I only assigned TRUE
and FALSE
's to this column. Again, I would have loved to produce a little example but it does not work, in a sense that the issue disappears if I select a subset. For example if a select row 47639 (id
is unique):
> data[id==47639, list(del)]
del
1: TRUE
del
seems to be set to TRUE
. So if I first select this specific row and then test whether del
is TRUE
it works:
> data[id==47639][del==TRUE, list(del)]
del
1: TRUE
But data[del==TRUE][id==47639, list(del)]
produces no output.
Empty data.table (0 rows) of 1 col: del
[Edit] I have spent a considerable amount of time in reducing the size of my dataset to make a reproduceable example. It now has only 132 rows and 1 column and still the odd behaviour. This is what I get:
data[, summary(del)]
Mode FALSE TRUE NA's
logical 129 3 0
> nrow(data[del==TRUE])
[1] 1
> nrow(data[del==FALSE])
[1] 127
> nrow(data[del<0.5])
[1] 129
> nrow(data[del>0.5])
[1] 3
Using dput
to post the sample here fails because:
> dput(data)
structure(list(del = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)), .Names = "del", class = c("data.table",
"data.frame"), row.names = c(NA, -132L), .internal.selfref = <pointer: 0x1914ca8>, index = structure(integer(0), "`__del`" = integer(0)))
So dput
and dget
together produce:
> dput(data, "dt")
> dget("dt")
Error in parse(file = file, keep.source = keep.source) :
dt:16:62: unexpected '<'
15: FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)), .Names = "del", class = c("data.table",
16: "data.frame"), row.names = c(NA, -132L), .internal.selfref = <
This looks like a bug to me. Of course, I could use the workaround data[(del)]
but that means rewriting all my previous code to make it safe. What concerns me most is that I do not know which operation exactly corrupts data
.
I could also provide a tiny RData file if that is of any help but I do not know how to post it here correctly.