is it better to use integer64, numeric, or character for large integer id numbers?

Question

I am working with a dataset that has several columns that represent integer ID numbers (e.g. transactionId and accountId). These ID numbers are are often 12 digits long, which makes them too large to store as a 32 bit integer.

What's the best approach in a situation like this?

Read the ID in as a character string.
Read the ID as a integer64 using the bit64 package.
Read the ID as a numeric (i.e. double).

I have been warned about the dangers of testing equality with doubles, but I'm not sure if that will be a problem in the context of using them as IDs, where I might merge and filter based on them, but never do arithmetic on the ID numbers.

Character strings seems intuitively like it should be slower to test for equality and do merges, but maybe in practice it doesn't make much of a difference.

Conceptionally those are characters (or even a factor variable) and I would treat them as such. A data.table merge with a character key is very fast. — Roland, Feb 03 '16 at 08:44

ctbrown · Answer 1 · 2016-10-11T20:11:31.173

10

See comment by Roland to the original question. Your IDs should be character vectors. Since it is very unlikely that IDs are used for math-like operations, it is generally safer to store the value as a character vectors. He also points out that the speed of merges in data.table using character vectors are very fast. Perhaps not as fast as integer merges, but nonetheless fast. In most cases this should be okay.

edited Oct 11 '16 at 20:11

answered Oct 11 '16 at 19:05

ctbrown

2,271
17
24

5

"Since it is very unlikely that IDs are used for math-like operations, it is generally safe to store the value as a character vectors." Not just as safe, but *safer*, since if you accidentally do something mathematical with the ID, like `lapply(DF, median)`, the mistake is easier to catch. – Frank Oct 11 '16 at 19:13

score 8 · Answer 2 · answered Feb 03 '16 at 08:38

If performance you are after use bit64.

With ’integer64’ vectors you can store very large integers at the expense of 64 bits, which is by factor 7 better than ’int64’ from package ’int64’. Due to the smaller memory footprint, the atomic vector architecture and using only S3 instead of S4 classes, most operations are one to three orders of magnitude faster: Example speedups are 4x for serialization, 250x for adding, 900x for coercion and 2000x for object creation. Also ’integer64’ avoids an ongoing (potentially infinite) penalty for garbage collection observed during existence of ’int64’ objects (see code in example section).

See the following PDF: https://cran.r-project.org/web/packages/bit64/bit64.pdf

is it better to use integer64, numeric, or character for large integer id numbers?

2 Answers2

Linked