I am trying to work with larger numbers, over 2^32. While I am also using data.table and fread, I do not believe the problem is related to them. I can turn on and off they symptoms without changing data.table or having used fread. My symptoms are that I am getting a reported mean of 4.1e-302 when I am expecting a positive exponent 1e+3 to 1e+17
The problem consistently appears when using the bit64 package and functions related to integer64. Things work for me in "regular sized data and R" but I am not expressing things correctly in this package. See my code and data below.
I am on a MacBook Pro, 16GB, i7 (updated).
I have restarted my R session and cleared the workspace, but the problem consistently remains.
Please kindly advise, I appreciate the input. I assume it has to to with using the library, bit64.
Links I looked at included bit64 doc
An issue that had similar symptoms caused by an fread() memory leak, but I think I eliminated
Here is my input data
var1,var2,var3,var4,var5,var6,expected_row_mean,expected_row_stddev
1000 ,993 ,987 ,1005 ,986 ,1003 ,996 ,8
100000 ,101040 ,97901 ,100318 ,96914 ,97451 ,98937 ,1722
10000000 ,9972997 ,9602778 ,9160554 ,8843583 ,8688500 ,9378069 ,565637
1000000000 ,1013849241 ,973896894 ,990440721 ,1030267777 ,1032689982 ,1006857436 ,23096234
100000000000 ,103171209097 ,103660949260 ,102360301140 ,103662297222 ,106399064194 ,103208970152 ,2078732545
10000000000000 ,9557954451905 ,9241065464713 ,9357562691674 ,9376495364909 ,9014072235909 ,9424525034852 ,334034298683
1000000000000000 ,985333546044881 ,994067361457872 ,1034392968759970 ,1057553099903410 ,1018695335152490 ,1015007051886440 ,27363415718203
100000000000000000 ,98733768902499600 ,103316759127969000 ,108062824583319000 ,111332326225036000 ,108671041505404000 ,105019453390705000 ,5100048567944390
My code, working with this sample data
# file: problem_bit64.R
# OBJECTIVE: Using larger numbers, I want to calculate a row mean and row standard deviation
# ERROR: I don't know what I am doing wrong to get such errors, seems bit64 related
# PRIORITY: BLOCKED (do this in Python instead?)
# reported Sat 9/24/2016 by Greg
# sample data:
# each row is 100 times larger on average, for 8 rows, starting with 1,000
# for the vars within a row, there is 10% uniform random variation. B2 = ROUND(A2+A2*0.1*(RAND()-0.5),0)
# Install development version of data.table --> for fwrite()
install.packages("data.table", repos = "https://Rdatatable.github.io/data.table", type = "source")
require(data.table)
require(bit64)
.Machine$integer.max # 2147483647 Is this an issue ?
.Machine$double.xmax # 1.797693e+308 I assume not
# -------------------------------------------------------------------
# ---- read in and basic info that works
csv_in <- "problem_bit64.csv"
dt <- fread( csv_in )
dim(dt) # 6 8
lapply(dt, class) # "integer64" for all 8
names(dt) # "var1" "var2" "var3" "var4" "var5" "var6" "expected_row_mean" "expected_row_stddev"
dtin <- dt[, 1:6, with=FALSE] # just save the 6 input columns
... now the problems start
# -------------------------------------------------------------------
# ---- CALCULATION PROBLEMS START HERE
# ---- for each row, I want to calculate the mean and standard deviation
a <- apply(dtin, 1, mean.integer64); a # get 8 values like 4.9e-321
b <- apply(dtin, 2, mean.integer64); b # get 6 values like 8.0e-308
# ---- try secondary variations that do not work
c <- apply(dtin, 1, mean); c # get 8 values like 4.9e-321
c <- apply(dtin, 1, mean.integer64); c # same result
c <- apply(dtin, 1, function(x) mean(x)); c # same
c <- apply(dtin, 1, function(x) sum(x)/length(x)); c # same results as mean(x)
##### I don't see any sd.integer64 # FEATURE REQUEST, Z-TRANSFORM IS COMMON
c <- apply(dtin, 1, function(x) sd(x)); c # unrealistic values - see expected
Regular sized R on Regular data, still using the data read in with fread() into a data.table() - WORKS
# -------------------------------------------------------------------
# ---- delete big numbers, and try regular stuff - WHICH WORKS
dtin2 <- dtin[ 1:3, ] # just up to about 10 million (SAME DATA, SAME FREAD, SAME DATA.TABLE)
dtin2[ , var1 := as.integer(var1) ] # I know there are fancier ways to do this
dtin2[ , var2 := as.integer(var2) ] # but I want things to work before getting fancy.
dtin2[ , var3 := as.integer(var3) ]
dtin2[ , var4 := as.integer(var4) ]
dtin2[ , var5 := as.integer(var5) ]
dtin2[ , var6 := as.integer(var6) ]
lapply( dtin2, class ) # validation
c <- apply(dtin2, 1, mean); c # get 3 row values AS EXPECTED (matching expected columns)
c <- apply(dtin2, 1, function(x) mean(x)); c # CORRECT
c <- apply(dtin2, 1, function(x) sum(x)/length(x)); c # same results as mean(x)
c <- apply(dtin2, 1, sd); c # get 3 row values AS EXPECTED (matching expected columns)
c <- apply(dtin2, 1, function(x) sd(x)); c # CORRECT