4

This one’s quite weird. Not sure if I’m missing something, or whether it’s a bug in data.table or fread.

I’m trying to “stretch” a data table with a time series where one time point is missing. When this table is read from a file, the X[Y] join fills NA’s in the missing row but also in other rows where data points are present. This happens only when the t column used for keying contains floats as opposed to integers.

library(data.table)

# This works fine; empty row at t=0.5
# is filled with NA after join

dt = data.table(id = as.integer(rep(0, 10)), 
                t = seq(0.1, 1, 0.1), 
                y = 1:10, 
                key = "id,t")
dt = dt[!(t == 0.5)]

dtAux = dt[, 
           .(seq(min(t), max(t), 0.1)),
           by = id]
setkey(dtAux, id, V1)

dt[dtAux]

    id   t  y
 1:  0 0.1  1
 2:  0 0.2  2
 3:  0 0.3  3
 4:  0 0.4  4
 5:  0 0.5 NA
 6:  0 0.6  6
 7:  0 0.7  7
 8:  0 0.8  8
 9:  0 0.9  9
10:  0 1.0 10

# This fails; NA’s created in multiple rows

fwrite(dt, "test.csv", row.names = F)
dtFromFile = fread("test.csv")
setkey(dtFromFile, id, t)

dtAux = dtFromFile[, 
                   .(seq(min(t), max(t), 0.1)),
                   by = id]
setkey(dtAux, id, V1)

dtFromFile[dtAux]

    id   t  y
 1:  0 0.1  1
 2:  0 0.2  2
 3:  0 0.3 NA
 4:  0 0.4  4
 5:  0 0.5 NA
 6:  0 0.6  6
 7:  0 0.7 NA
 8:  0 0.8  8
 9:  0 0.9  9
10:  0 1.0 10

Tested on 3.6.1 with data.table 1.12.4

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux bullseye/sid

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.4

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1
mattek
  • 903
  • 1
  • 6
  • 18
  • 1
    It's a precision problem. See `dt$t-dtFromFile$t`. You'll see that the values read from file aren't exactly the same as the original – Rohit Oct 04 '19 at 06:18
  • @Rohit I was fooled by `all.equal` that yielded `TRUE`. Then, of course, when I reduce `tolerance` parameter to something like `1e-18` I get the difference. – mattek Oct 04 '19 at 09:13

2 Answers2

2

From ?setNumericRounding

Computers cannot represent some floating point numbers (such as 0.6) precisely, using base 2. This leads to unexpected behaviour when joining or grouping columns of type ’numeric’; i.e. ’double’, see example below. In cases where this is undesirable, data.table allows rounding such data up to approximately 11 s.f. which is plenty of digits for many cases. This is achieved by rounding the last 2 bytes off the significand. Other possible values are 1 byte rounding, or no rounding (full precision, default). It is bytes rather than bits because it is tied in with the radix sort algorithm for sorting numerics which sorts byte by byte. With the default rounding of 0 bytes, at most 8 passes are needed. With rounding of 2 bytes, at most 6 passes are needed (and therefore might be a tad faster). For large numbers (integers > 2^31), we recommend using bit64::integer64, even though the default is to round off 0 bytes (full precision).

To fix it you can use the below before running your code.

setNumericRounding(2)

See ?setNumericRounding for more examples and also Round to a multiple and filter in data.table


edit: OP is asking why it works when its in the script but not when data is read from the file. It might be due to ALTREP (see slide 9 here).

library(data.table) #data.table_1.12.2 R Win x64
dt_seq <- data.table(x = seq(0.1, 0.5, 0.2), v=1:3, key = "x")
dt_c <- data.table(x = c(0.1,0.3,0.5), v=1:3, key = "x")

dtAux_seq = data.table(x=seq(0.1, 0.5, 0.1), key="x")
dtAux_c = data.table(x=c(0.1,0.2,0.3,0.4,0.5), key="x")

fwrite(dt_seq, "test.csv")
dtFromFile <- fread("test.csv", key="x")

Testing diff joins:

> dt_seq[dtAux_seq]
     x  v
1: 0.1  1
2: 0.2 NA
3: 0.3  2
4: 0.4 NA
5: 0.5  3

> dt_c[dtAux_seq]
     x  v
1: 0.1  1
2: 0.2 NA
3: 0.3 NA
4: 0.4 NA
5: 0.5  3

> dtFromFile[dtAux_seq]
     x  v
1: 0.1  1
2: 0.2 NA
3: 0.3 NA
4: 0.4 NA
5: 0.5  3

Explicitly typing values in dtAux_c:

> dt_seq[dtAux_c]
     x  v
1: 0.1  1
2: 0.2 NA
3: 0.3 NA
4: 0.4 NA
5: 0.5  3

> dt_c[dtAux_c]
     x  v
1: 0.1  1
2: 0.2 NA
3: 0.3  2
4: 0.4 NA
5: 0.5  3

> dtFromFile[dtAux_c]
     x  v
1: 0.1  1
2: 0.2 NA
3: 0.3  2
4: 0.4 NA
5: 0.5  3
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
2

This is due to floating point errors on decimal values. This is not a R-specific issue, but just the way computers handle decimals.
read for more info: Why are these numbers not equal?

When joining on decimal values, a solution is to round them first to the relevant number of digits.

dtAux2 = dtFromFile[, 
                   .( round( seq(min(t), max(t), 0.1), digits = 1 ) ),
                   by = id]
dt_Aux2
setkey(dtAux2, id, V1)

dtFromFile[dtAux2]

#   id   t  y
#1:  0 0.1  1
#2:  0 0.2  2
#3:  0 0.3  3
#4:  0 0.4  4
#5:  0 0.5 NA
#6:  0 0.6  6
#7:  0 0.7  7
#8:  0 0.8  8
#9:  0 0.9  9
#10:  0 1.0 10
Wimpel
  • 26,031
  • 1
  • 20
  • 37