Memory bottleneck in seqdist?

Question

Is it possible that there is a memory bottleneck in seqdist()?

I'm a researcher working with register data on a Windows x64 computer with 64 gb of RAM. Our data consists of 60,000 persons, and at the moment I'm working on a data that has about 2.2 million lines in SPELL format. I can't run seqdist on it (method="OM", indel=1, sm="TRATE", with.missing=TRUE, full.matrix=FALSE), error message is the same as in here, where the important part seems to point to not large enough memory: "negative length vectors are not allowed".

Ok, but seqdist() doesn't seem to utilize my whole RAM. Right now I'm running it on a sample of 40,000 persons, and it seems to go through, but R is using less than 2 gbs of RAM. If I run seqdist() on 60,000 persons, I get the error.

Might there be a size limit of 2^31-1 in there somewhere?

Calculating ward clusters readily utilizes all available RAM. I've had it use up to 40 gbs of RAM, which at least proves R is capable of utilizing large amounts of RAM.

Edit: Maximum number of cases is exactly 46341. Warning though, eats memory if size <= 46341. Example:

library(TraMineR)

id <- seq(from=1, to=46342, by=1)
set.seed(234324)
time1 <- sample(seq(from=1, to=3, by=1), size=46342, replace=TRUE)
time2 <- sample(seq(from=1, to=3, by=1), size=46342, replace=TRUE)
time3 <- sample(seq(from=1, to=3, by=1), size=46342, replace=TRUE)

testdata <- data.frame(id, time1, time2, time3)

testseq <- seqdef(testdata, 2:4)

testdist <- seqdist(testseq, method="OM", indel=1, sm="TRATE", full.matrix=FALSE)

By chance, if comparing sequence complexity by calculating (seq1/seq2)^2, we find that a sample of 40,000 out of 60,000 is slightly less than half in complexity compared to total sample. If indeed this sample needs 2 gbs of memory for calculating distances, and there is a 4gb limit, 60,000 is too large data. — pasipasi, Jan 28 '16 at 12:44
Yes, R vectors were limited to 2^32-1 and code in contributed packages often use the older APIs that assume vectors are less than or equal to that length; the error occurs when the package sees a vector that is too long and does not check for it. I'm not familiar with the specific package you're using, so don't have a solution. Basically though it sounds like you'd like an algorithm that does not require n^2 / 2 distances, regardless of the limitations in R's vector representation. — Martin Morgan, Jan 28 '16 at 19:01
There's a limit on cases due to this, which is exactly n=46341. Full matrix or not, I can't go above 46341=sqrt(2^31-1). — pasipasi, Jan 29 '16 at 08:24
And btw, I don't know why my initial example doesn't eat memory, like my artificial example does (you need to change 46342 to 46341 or less for it to work). See example above. — pasipasi, Jan 29 '16 at 12:04
Did you use `WeightedCluster` to aggregate similar sequences? This often reduces the size of your dataset considerably. — non-numeric_argument, Feb 10 '17 at 08:34
Yes I did, and because I'm using fairly complex data, there's not too many similar sequences, so aggregating didn't help. — pasipasi, Feb 15 '17 at 10:46

Memory bottleneck in seqdist?

0 Answers0