1

I am currently working with ngrams, which are stored in a data.table in a numeric format, where each word in a vocabulary is given a unique 5 digit number and a single 4-gram looks like this :

10000100001017060484

The reason for storing ngrams in this manner is that numeric objects take up much less space in R. Hence, I am working with some large numbers, which I occasionally need to convert to character and back to do some string manipulation. Today, I noticed that my Rstudio does not seem to store large numbers correctly. For example :

as.numeric(125124313242345145234513234432)
[1] 125124313242345143744028208602

As you can see, the top number is very different from bottom. The only global option I used was:

options(scipen=999)

Can someone explain why is this happening and how can I fix it?

Regards, Kamran.

Kamran
  • 21
  • 1
  • 7
  • Take a look at https://stackoverflow.com/questions/9508518/why-are-these-numbers – Dason Dec 23 '17 at 20:19
  • Does that also apply to integers that are not calculated, @Dason? I understood the difference between numbers to be "somewhere after the comma/dot". Kamran seems to receive wrong numbers also for the code in below "answer", which does not perform any calcuations. But maybe I will simply have to dig deeper in R Inferno, etc. for understanding this. – Manuel Bickel Dec 23 '17 at 20:48
  • UPDATE: I realized that this problem does not affect small numeric entries and integers. Subsequently, I found a package called bit64, which allows to store integers in R as 64bit, as opposed to default 32bit. This does allow for creation of large integers, but not large enough for my use. The search continues... – Kamran Dec 23 '17 at 21:05
  • UPDATE: I have managed to replicate the same issue on another laptop, with a freshly installed R and R studio. My laptop runs windows 10, the other laptop ran windows 7. I am starting to believe that this is not a problem with settings, but something to do with the numeric object class itself. I will contact R studio support to see if I can get this clarified. – Kamran Dec 24 '17 at 07:40

2 Answers2

2

If you run .Machine$integer.max, it would return 2147483647 which means R can't by default would handle integer greater than 2147483647. If you run .Machine$double.xmax, you would get a value of 1.797693e+308 which is the maximum double representation of floating number in R.The reasoning could be seen as exponent(308) and significand(1.797...) which are two different sets of storing the numbers.

?.Machine

http://sites.stat.psu.edu/~drh20/R/html/base/html/zMachine.html

In your case if you try to append L (way of telling R that you want to store something like an integer) in the number you will get something like this:

as.numeric(125124313242345145234513234432L)
[1] 1.251243e+29
Warning message:
non-integer value 125124313242345145234513234432L qualified with L; using numeric value 

Hence you can see because of this limitations on saving integer and double in R you are getting this outcome.

To overcome this, you can convert it into a bigz using gmp library

 as.bigz("125124313242345145234513234432")

Output:

Big Integer ('bigz') :
[1] 125124313242345145234513234432

This is my understanding about storing numbers in R, It might not be perfect but this how I see things in R for storing numbers.

You may choose to see the gmp documentation: https://cran.r-project.org/web/packages/gmp/gmp.pdf

PKumar
  • 10,971
  • 6
  • 37
  • 52
  • Hi Kumar, this is a good suggestion. I have tried it and indeed, it does display the integer correctly. However, I am using numbers to represent words in a sequence to save RAM and a 25 digit Big Integer object takes up 280 bytes, while numeric class only uses 48. Thank you for the help though. – Kamran Dec 24 '17 at 07:37
1

Sorry for making this an answer, but its too long for a comment. What happens if you run below code. On my machine with scipen = 999 your conversion works fine. Have you stored your numbers for the ngrams really as numeric? In below code you may see that a potential error might arise from converting between character and numeric depending on the settings.

mynumber <- 125124313242345145234513234432
options(scipen = 999)
mynumber == as.numeric(mynumber)
#[1] TRUE
mynumber == as.numeric(as.character(mynumber))
#[1] TRUE

options(scipen = 0)
mynumber == as.numeric(mynumber)
#[1] TRUE
mynumber == as.numeric(as.character(mynumber))
#[1] FALSE 
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22
  • 1
    Hi Manuel, thank you for the reply. I get the same results as you and I can assure you that ngrams were numeric. Incidentally, my Rstudio stores "mynumber" as 125124313242345143744028208602. – Kamran Dec 23 '17 at 20:16
  • That is really strange. I will have to dig deeper into comparison of `options`. Maybe we could compare the R default `options` to yours and check where there are differences. Maybe some other option corrupts your command. I am not an expert regarding the settings, hence, I will need some time to answer. In any case it will be worth checking all options concering formatting of numbers. – Manuel Bickel Dec 23 '17 at 20:41