1

I have a large dataset with above 2 million sequences, including about 180,000 unique ones. I am using the seqdist command to measure distances, and I'll ultimately also try to identify clusters of sequences. Below is the error message I get:

Code and error message

Is there any way of setting a different maximum number of sequences, or some other workaround? Thank you very much in advance!

  • A workaround is to cluster a representative sample of your sequences. See https://stackoverflow.com/a/15929938/1586731 – Gilbert Jul 14 '20 at 06:32
  • Thank you! This helps with clustering, but I do need to compute measures about each individual sequence (even rare ones). For each sequence, I need to measure (1) how distant it is from specific other sequences in the dataset, and (2) whether it belongs to the same cluster as these sequences. Would a solution then be to compute distances for each pair of sequences individually instead of the whole distance matrix? This would solve (1), and I could solve (2) by first identifying clusters through a representative sample and then identifying which cluster is closest to each remaining sequence. – Julien Clément Jul 14 '20 at 18:56

1 Answers1

1

The size limits for the distance matrix follows from the maximum allowed index value. This value is machine dependent.

For huge number n of data, a solution is to select a random representative subset of the sequences, compute the dissimilarities for this subset, and cluster the subset.

If a cluster membership is needed for each individual sequence, you can identify the medoid of each of the clusters obtained from the subset and then assign each individual sequence to the closest medoid. For k clusters, this requires to compute n x k distances instead of the full pairwise matrix.

I illustrate below using the biofam data that ships with TraMineR.

Note that up to version 2.2-0.1, TraMineR tested for the size of the pairwise distance matrix even when refseq was used. This has been fixed in version 2.2-1.

library(TraMineR)
data(biofam)
b.seq <- seqdef(biofam[, 10:25])

## compute pairwise distances on a random subset
spl <- sample(nrow(b.seq),400)
bs.seq <- b.seq[spl,]
d.lcs <- seqdist(bs.seq, method="LCS", full.matrix=FALSE)

## cluster the random subset
bs.hclust <- hclust(as.dist(d.lcs), method="ward.D")
#plot(bs.hclust, labels=FALSE)
cl <- cutree(bs.hclust,k=4)

## plot clusters for random subset
seqdplot(bs.seq, group=cl, border=NA)

## Medoids of the clusters
c.cl <- disscenter(d.lcs, group=cl, medoids="first")
seqiplot(bs.seq[c.cl,]) # plot of the medoids

## distances to each medoids
dc <- matrix(0,nrow=nrow(b.seq),ncol=length(c.cl))
for (i in 1:length(c.cl)) {
  dc[,i] <- seqdist(b.seq,method="LCS",refseq=spl[c.cl[i]])
}

## cluster membership for the full sequence dataset
##  is for each row the column with the smallest distance
cl.all <- max.col(-dc) 

## now we can plot clusters for the whole dataset
seqdplot(b.seq, group=cl.all, border=NA)
Gilbert
  • 3,570
  • 18
  • 28
  • Is this method valid? I'm asking because the cluster classification obtained from `cutree` is not the same as the one obtained from the distance to each medoids. `is_same_cluster = cl.all[spl] == cl`; `table(is_same_cluster)` – Fabio Vaz May 22 '21 at 14:27
  • This is just an illustration of how we could proceed. Of course minimizing distance to medoids is not equivalent to ward hierarchical cluster. – Gilbert May 23 '21 at 08:47