2

I have two lists of complex structures (each list is a multiPhylo object containing phylogenetic trees), and I would like to find out how many times each element of the first one appears in the second one. Pretty straightforward, but for some reasons my code returns wrong results.

library(devtools)
install_github('santiagosnchez/rBt')
library(rBt)

beast_output <- read.annot.beast('strict_BD_starBEAST_logcomb.species.trees')
beast_output_rooted <- root(beast_output, c('taxon_A', 'taxon_B')) # length == 20,000
unique_multiphylo <- unique.multiPhylo(beast_output_rooted) # length == 130

count <- function(item, list) {
  total = 0
  for (i in 1:length(list)) {
    if (all.equal.phylo(item, list[[i]])) {
      total = total + 1
    }
  }
  return(total)
}

result <- data.frame(un_tr = rep(0, 130), count = rep(0, 130))
for (i in 1:length(unique_multiphylo)) {
  result[i, ] <- c(i, count(unique_multiphylo[[i]], beast_output_rooted))
}

The function all.equal.phylo() takes two phylo objects and returns TRUE if they are the same. See docs. The function count() takes an item and a list and returns the number of times the item appears in the list using all.equal.phylo().

The issue is that the function count() returns 0 most of the time. This should not be possible as the list unique_multiphylo is a sublist of beast_output_rooted, which means that count() should at least return 1.

What is wrong with my code? And how can I correct it? Many thanks for the help!

EDIT: here is a reproducible example:

install.packages('ape')
library(ape)

set.seed(42)
trees <- lapply(rep(c(10, 25, 50, 100), 3), rtree) # length == 12
class(trees) <- 'multiPhylo'
unique_multiphylo <- unique.multiPhylo(trees) # length == 12

count <- function(item, list) {
  total = 0
  for (i in 1:length(list)) {
    if (all.equal.phylo(item, list[[i]])) {
      total = total + 1
    }
  }
  return(total)
}

result <- data.frame(un_tr = rep(0, 12), count = rep(0, 12))
for (i in 1:length(unique_multiphylo)) {
  result[i, ] <- c(i, count(unique_multiphylo[[i]], trees))
}

However, it seems to be working perfectly fine with these simulated data...

jvddorpe
  • 315
  • 1
  • 2
  • 10
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Try to make the question self sufficient so all relevant parts can by copy/pasted. Does your problem require the `rBt` library? – MrFlick Feb 04 '19 at 20:22
  • Unfortunately, I cannot share my data as they have not been published yet, but I will try to find and share a similar dataset first thing tomorrow morning! I use the `rBt` library to import my BEAST output: `read.annot.beast()`. – jvddorpe Feb 04 '19 at 20:29
  • @MrFlick, I updated my question. – jvddorpe Feb 05 '19 at 09:09
  • If the provided data doesn't reproduce your problem we cannot test the solutions for your problem with this data... Are you sure that by rooting the phylogenetic tree remains constant? – llrs Feb 05 '19 at 15:56

2 Answers2

2

I finally managed to get proper results. In the function all.equal.phylo(), I needed to set the parameter use.edge.length to FALSE so that only the topologies of the phylogenetic trees are compared.

Here is my code:

(I changed the names of a couple of variables to make it clearer what I was trying to do)

install.packages('devtools')
library(devtools)
install_github('santiagosnchez/rBt')
library(rBt)

beast_output <- read.annot.beast('beast_output.trees')
beast_output_rooted <- root.multiPhylo(beast_output, c('taxon_A', 'taxon_B'))
unique_topologies <- unique.multiPhylo(beast_output_rooted)

count <- function(item, list) {
  total = 0
  for (i in 1:length(list)) {
    if (all.equal.phylo(item, list[[i]], use.edge.length = FALSE)) {
      total = total + 1
    }
  }
  return(total)
}

result <- data.frame(unique_topology = rep(0, length(unique_topologies)),
                     count = rep(0, length(unique_topologies)))
for (i in 1:length(unique_topologies)) {
  result[i, ] <- c(i, count(unique_topologies[[i]], beast_output_rooted))
}

result$percentage <- ((result$count/length(beast_output_rooted))*100)
jvddorpe
  • 315
  • 1
  • 2
  • 10
1

There is a shorter solution to your problem:

table( attr(unique_multiphylo, "old.index") )

as unique_multiphylo contains an attribute with the information you are after (see ?unique.multiPhylo).

klash
  • 341
  • 1
  • 2