5

I have an ID variable with 20 digits. Once i read the data in R , it changes to Scientific notation and then if i write the same id to csv file, the value of ID changes.

For example , running the below code should print me the value of x as "12345678912345678912",but it prints "12345678912345679872":

Code:

options(scipen=999)

x <- 12345678912345678912

print(x)

Output:

[1] 12345678912345679872

My questions are :

1) Why it is happening ?

2) How to fix this problem ?

I know it has to do with the storage of data types in R but still i think there should be some way to deal with this problem. I hope i am clear with this question.

I don't know if this question was asked or not in so point me to a link if its a duplicate.I will remove this post

I have gone through this, so i can relate with the issue of mine, but i am unable to fix it.

Any help would be highly appreciated. Thanks

Community
  • 1
  • 1
PKumar
  • 10,971
  • 6
  • 37
  • 52
  • why don't you format your variable as character ? – Cath Jan 13 '15 at 10:06
  • Thanks for replying,The problem persists, if i use as.character(x) , the value of x is again "12345678912345679872" – PKumar Jan 13 '15 at 10:07
  • 3
    I meant to format it "previously", like when you import your data, you can specify character colClasses for ID variable (so kind of doing x<-"12345678912345678912"). Would this work ? – Cath Jan 13 '15 at 10:08
  • else, you maybe can specify a larger number of digits with `options(digits=30)` for example ? – Cath Jan 13 '15 at 10:11
  • 3
    The number is to big to be represented as an integer. Thus, it is represented as a double, which leads to [issues with floating point number accuracy](http://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal). There are [possibilities to use big integers](http://stackoverflow.com/questions/2053397/long-bigint-decimal-equivalent-datatype-in-r) in R, but since your numbers are ids you should follow CathG's advice and treat them as character strings. – Roland Jan 13 '15 at 10:15
  • 1
    @CathG Yes it works, This is how i am doing it now: read.csv("file.csv",colClasses=c("character",rep(NULL,1))) as i have only two columns (ID and value). Thanks , By the way you can put your thought as answer, I would love to accept your answer. – PKumar Jan 13 '15 at 10:19
  • 1
    ok great. I'm guessing you're second column is numeric ? so you can rather do colClasses=c("character","numeric") (by the way, no need to use `rep` if you're repeating just once ;-) ) – Cath Jan 13 '15 at 10:34
  • Thanks @CathG , you can put your thoughts as an answer, It would be helpful to everyone. – PKumar Jan 13 '15 at 10:37

3 Answers3

3

R does not by default handle integers numerically larger than 2147483647L.

If you append an L to your number (to tell R its an integer), you get:

x <- 12345678912345678912L
#Warning message:
#non-integer value 12345678912345678912L qualified with L; using numeric value 

This also explains the change of the last digits as R stores the number as a double.

I think the gmp-package should be able to handle large numbers in general. You should therefore either accept the loss of precision, store them as character stings, or use a data-type from the gmp package.

Anders Ellern Bilgrau
  • 9,928
  • 1
  • 30
  • 37
1

To circumvent the problem due to number storing/representation, you can import your ID variable directly as character with the option colClasses, for example, if using read.csv and importing a data.frame with the ÌD column and another numeric column:

mydata<-read.csv("file.csv",colClasses=c("character","numeric"),...)
Cath
  • 23,906
  • 5
  • 52
  • 86
1

Using readr you can do

mydata <- readr::read_csv("file.csv", col_types = list(ID=col_character()))

where "ID" is the name of your ID column

MrFlick
  • 195,160
  • 17
  • 277
  • 295