Today I have finally decided to start climbing R's steep learning curve. I have spent a few hours and I managed to import my dataset and do a few other basic things, but I am having trouble with the data type: a column which contains decimals is imported as integer, and conversion to double changes the values.
In trying to get a small csv file to put here as an example I discovered that the problem only happens when the data file is too large (my original file is a 1048418 by 12 matrix, but even with "only" 5000 rows I have the same problem. When I only have 100, 1000 or even 2000 rows the column is imported correctly as double).
Here is a smaller dataset (still 500kb, but again, if the dataset is small the problem is not replicated). The code is
> ex <- read.csv("exampleshort.csv",header=TRUE)
> typeof(ex$RET)
[1] "integer"
Why is the column of returns being imported as integer when the file is large, when it is clearly of the type double?
The worst thing is that if I try to convert it to double, the values are changed
> exdouble <- as.double(ex$RET)
> typeof(exdouble)
[1] "double"
> ex$RET[1:5]
[1] 0.005587 -0.005556 -0.005587 0.005618 -0.001862
2077 Levels: -0.000413 -0.000532 -0.001082 -0.001199 -0.0012 -0.001285 -0.001337 -0.001351 -0.001357 -0.001481 -0.001486 -0.001488 ... 0.309524
> exdouble[1:5]
[1] 1305 321 322 1307 41
This is not the only column that is imported wrong, but I figured that if I find a solution for one column, I should be able to sort the other ones out. Here is some more information:
> sapply(ex,class)
PERMNO DATE COMNAM SICCD PRC RET RETX SHROUT VWRETD VWRETX EWRETD EWRETX
"integer" "integer" "factor" "integer" "factor" "factor" "factor" "integer" "numeric" "numeric" "numeric" "numeric"
They should be in this order: integer, date, string, integer, double, double, double, integer, double, double, double, double (the types are probably wrong, but hopefully you will get what I mean)