The size limits for the distance matrix follows from the maximum allowed index value. This value is machine dependent.
For huge number n of data, a solution is to select a random representative subset of the sequences, compute the dissimilarities for this subset, and cluster the subset.
If a cluster membership is needed for each individual sequence, you can identify the medoid of each of the clusters obtained from the subset and then assign each individual sequence to the closest medoid. For k clusters, this requires to compute n x k distances instead of the full pairwise matrix.
I illustrate below using the biofam
data that ships with TraMineR.
Note that up to version 2.2-0.1, TraMineR tested for the size of the pairwise distance matrix even when refseq
was used. This has been fixed in version 2.2-1.
library(TraMineR)
data(biofam)
b.seq <- seqdef(biofam[, 10:25])
## compute pairwise distances on a random subset
spl <- sample(nrow(b.seq),400)
bs.seq <- b.seq[spl,]
d.lcs <- seqdist(bs.seq, method="LCS", full.matrix=FALSE)
## cluster the random subset
bs.hclust <- hclust(as.dist(d.lcs), method="ward.D")
#plot(bs.hclust, labels=FALSE)
cl <- cutree(bs.hclust,k=4)
## plot clusters for random subset
seqdplot(bs.seq, group=cl, border=NA)
## Medoids of the clusters
c.cl <- disscenter(d.lcs, group=cl, medoids="first")
seqiplot(bs.seq[c.cl,]) # plot of the medoids
## distances to each medoids
dc <- matrix(0,nrow=nrow(b.seq),ncol=length(c.cl))
for (i in 1:length(c.cl)) {
dc[,i] <- seqdist(b.seq,method="LCS",refseq=spl[c.cl[i]])
}
## cluster membership for the full sequence dataset
## is for each row the column with the smallest distance
cl.all <- max.col(-dc)
## now we can plot clusters for the whole dataset
seqdplot(b.seq, group=cl.all, border=NA)