Is it possible that there is a memory bottleneck in seqdist()?
I'm a researcher working with register data on a Windows x64 computer with 64 gb of RAM. Our data consists of 60,000 persons, and at the moment I'm working on a data that has about 2.2 million lines in SPELL format. I can't run seqdist on it (method="OM", indel=1, sm="TRATE", with.missing=TRUE, full.matrix=FALSE), error message is the same as in here, where the important part seems to point to not large enough memory: "negative length vectors are not allowed".
Ok, but seqdist() doesn't seem to utilize my whole RAM. Right now I'm running it on a sample of 40,000 persons, and it seems to go through, but R is using less than 2 gbs of RAM. If I run seqdist() on 60,000 persons, I get the error.
Might there be a size limit of 2^31-1 in there somewhere?
Calculating ward clusters readily utilizes all available RAM. I've had it use up to 40 gbs of RAM, which at least proves R is capable of utilizing large amounts of RAM.
Edit: Maximum number of cases is exactly 46341. Warning though, eats memory if size <= 46341. Example:
library(TraMineR)
id <- seq(from=1, to=46342, by=1)
set.seed(234324)
time1 <- sample(seq(from=1, to=3, by=1), size=46342, replace=TRUE)
time2 <- sample(seq(from=1, to=3, by=1), size=46342, replace=TRUE)
time3 <- sample(seq(from=1, to=3, by=1), size=46342, replace=TRUE)
testdata <- data.frame(id, time1, time2, time3)
testseq <- seqdef(testdata, 2:4)
testdist <- seqdist(testseq, method="OM", indel=1, sm="TRATE", full.matrix=FALSE)