phylogenetic trees from nucleic acid words

Question

If for n nucleic acid sequences a table of word frequencies (the sequence ATG corresponds to two words of length 2, AT and TG) is constructed, then that table can be used (directly or after dimensionality-reduction by PCA) to calculate a distance matrix of these sequences, which can then be clustered into a phylogenetic tree (doi:10.1007/s00285-002-0185-3):

library(sequinr)
Bat1 <- read.fasta(file="bat1.FASTA")
Bat1.seq <- Bat1[[1]]
Bat1.count <- as.vector(count(Bat1.seq, 2)) # count word frequencies, k < log4(Sequence length)
...
Counts <- rbind(Bat1.count, ...)
rownames(Counts) <- c("Bat1", ...)
colnames(Counts) <- c(rownames(count(Bat1.seq, 2)))
RowCounts <- rowSums(Counts)     
Counts.norm <- Counts/RowCounts  # normalise word counts for different sequence length
distance <- dist(Counts.norm, method = "euclidian")
hc <- hclust(distance, method = "average")
plot(hc)

Phylogenetic tree of several virus sequences

This works surprisingly well, the output looks similar to a tree obtained by multiple sequence alignment with ClustalX, but the computation time is seconds rather than hours.

Question: How can I measure the quality of these trees, to choose optimal word lengths k or (if PCA is used) the optimal number of components q, also distance and clustering methods? Preferably without doing lengthy bootstraps with random sequences ;-).

I don't think there's a way to quantify tree quality without directly comparing to a reference tree (calculate Robinson–Foulds value). You might try changing k and q values to judge tree stability (does ancestry change?). — Ghoti, Jan 14 '21 at 17:45
You'll likely get a better response at https://biology.stackexchange.com — Ghoti, Jan 14 '21 at 17:46

score 0 · Answer 1 · answered Jan 21 '21 at 10:57

The most important characteristics of this tree is that this is not a phylogeny!

In a phylogeny, the edges reflect evolutionary processes and we ask whether two taxa share a common ancestor and how likely that is. Rather, the dendrogram in the OP's image represents DNA sequence composition similarities between taxa and is thus a phenetic tree. Understanding the difference between a phylogenetic and a phenetic tree is critical in deciding whether to use the suggested method. If the goal of the test is to infer evolutionary relationships between viruses, the method is not appropriate.

As the tree is not a phylogeny, the relationships need not be tested in an evolutionary history sense.

phylogenetic trees from nucleic acid words

1 Answers1