If for n nucleic acid sequences a table of word frequencies (the sequence ATG corresponds to two words of length 2, AT and TG) is constructed, then that table can be used (directly or after dimensionality-reduction by PCA) to calculate a distance matrix of these sequences, which can then be clustered into a phylogenetic tree (doi:10.1007/s00285-002-0185-3):
library(sequinr)
Bat1 <- read.fasta(file="bat1.FASTA")
Bat1.seq <- Bat1[[1]]
Bat1.count <- as.vector(count(Bat1.seq, 2)) # count word frequencies, k < log4(Sequence length)
...
Counts <- rbind(Bat1.count, ...)
rownames(Counts) <- c("Bat1", ...)
colnames(Counts) <- c(rownames(count(Bat1.seq, 2)))
RowCounts <- rowSums(Counts)
Counts.norm <- Counts/RowCounts # normalise word counts for different sequence length
distance <- dist(Counts.norm, method = "euclidian")
hc <- hclust(distance, method = "average")
plot(hc)
Phylogenetic tree of several virus sequences
This works surprisingly well, the output looks similar to a tree obtained by multiple sequence alignment with ClustalX, but the computation time is seconds rather than hours.
Question: How can I measure the quality of these trees, to choose optimal word lengths k or (if PCA is used) the optimal number of components q, also distance and clustering methods? Preferably without doing lengthy bootstraps with random sequences ;-).