Big numeric-type equality and subset in data.table package

Question

I am using the data.table package for data processing. I noticed issues with equality and subset when large numbers are involved. Ex:

dt <- data.table(a = c(1, 841026176807, 841026176808))
dt[a==841026176807]
          a
1: 841026176807
2: 841026176808

I thought it was loss of precision from numeric type (representation of double/floating point numbers), but this works:

dt[dt$a==841026176807]
          a
1: 841026176807

Why is the behavior not consistent? Is this documented somewhere or bug?

`841026176807 > .Machine$integer.max #[1] TRUE`, but despite this I think that the increased precision available in R's somewhat new use of 53bit integers should have kicked in. Sometimes one needs to use character values for data input but this should not be needed for console operations. — IRTFM, Jul 26 '16 at 06:44
This was due to default rounding of last 2 bytes for numeric types to avoid floating point inaccuracies as documented under `?setNumericRounding`. This behaviour is now restored to normal behaviour (i.e., no rounding anymore).. in the [current devel version](https://github.com/Rdatatable/data.table/wiki/Installation). — Arun, Jul 26 '16 at 14:09

score 3 · Answer 1 · answered Jul 26 '16 at 05:57

3

current implementations of R use 32-bit integers for integer vectors, so the range of re-presentable integers is restricted to about +/-2*10^9.

In case if you want to store/read values above it you need to store them 64bit.

package bit64 can handle this.

require(bit64) dt <- data.table(a = as.integer64(c(1, 841026176807, 841026176808)))

> dt[a==841026176807]
              a
1: 841026176807

answered Jul 26 '16 at 05:57

PPC

167
2
11

I upvoted but the first sentence is false. Please educate yourself about R's capacity and correct the misinformation. .Machine$integer.max no longer represents the upper limit of "integer precision" although it still limits the number of maximal dimension for vectors and matrices. – IRTFM Jul 26 '16 at 06:48
@42 please refer to help section of `?integer` in R. The 1st line of the my solution written is referred from there. Yup I may need to educate in R as i know i am not an expert. – PPC Jul 26 '16 at 09:00
Yes. That's a reasonable reference to start with. It also says : "doubles can hold much larger integers exactly." The help pages are not indexed in a manner that seems very helpful here, although I think this information was published in them at one point. I cited a News item: http://stackoverflow.com/questions/21140818/long-vector-not-supported-yet-error-in-r-windows-64bit-version/21142236#21142236 and updated an earlier answer when version 3.0 made the change: http://stackoverflow.com/questions/8804779/what-is-integer-overflow-in-r-and-how-can-it-happen/8804991#8804991 – IRTFM Jul 26 '16 at 11:15
And: http://stackoverflow.com/questions/29172300/conversion-of-long-values-into-double-in-r – IRTFM Jul 26 '16 at 11:28
In R, `1` is of type "numeric" ("double"), as opposed to `1L` which is integer type. – Arun Jul 26 '16 at 14:10
This seems exactly like my answer, which came first, yet this has 3x the upvotes? :P – Hack-R Jul 26 '16 at 14:28
@Hack-R both of them posted answers with in a minute gap.like minded people!! – PPC Jul 26 '16 at 16:12

score 1 · Answer 2 · answered Jul 26 '16 at 05:56

1

The different comparison methods invoke different functions under the hood and some of them can't handle the length of the integer. You can overcome this with interger64 from bit64, as is the standard practice when dealing with these long integers in R.

require(data.table)
require(bit64)
dt   <- data.table(a = c(1, 841026176807, 841026176808))
dt$a <- as.integer64(dt$a)
dt[a==841026176807]

a
1: 841026176807

dt[dt$a==841026176807]

a
1: 841026176807

Regardless of if you're using data.table or which operations you're carrying out, it's best to either use intger64 or to recode the data with integers of this length to avoid any inadvertent errors.

answered Jul 26 '16 at 05:56

Hack-R

22,422
14
75
131

I suspect we're looking at a dupe of http://stackoverflow.com/q/34285809/ and that's what you mean by differences under the hood? Anyways, I updated to the latest devel version yesterday and do not see the OP's behavior now so can't really figure it out myself. – Frank Jul 26 '16 at 12:17
1

@Frank, we've recently removed the rounding of 2-bytes feature (since people don't seem to read the manual and use `integer64` as we suggest.. Just have a look at the news.. – Arun Jul 26 '16 at 14:12
@Frank I just mean empirically speaking there's something different under the hood or this behavior wouldn't exist as such. I don't profess to know what it is. I just know that using `integer64` is the best practice for this type of data and solves the problem. – Hack-R Jul 26 '16 at 14:29

Big numeric-type equality and subset in data.table package

2 Answers2