combine OTU and tax table and replace actual sequences with OTU ids (Phyloseq/dada2)

Question

I was following the workflow described here https://f1000research.com/articles/5-1492/v2 using the sample data as well as my own data. This worked fine, but now I can't generate an OTU table, that contains a header such as "OTU00004" or even better "kingdom_phylum_..._Pseudomonas_OTU00004". I would like to use such a table to find and plot the abundance of a certain OTU over several samples.

I created an object called ps, that seems to be ok:

ps <- phyloseq(tax_table(taxtab), sample_data(samdf),
                 otu_table(seqtab, taxa_are_rows = FALSE),phy_tree(fitGTR$tree))    

> ps
    phyloseq-class experiment-level object
    otu_table()   OTU Table:         [ 454 taxa and 360 samples ]
    sample_data() Sample Data:       [ 360 samples by 14 sample variables ]
    tax_table()   Taxonomy Table:    [ 454 taxa by 6 taxonomic ranks ]
    phy_tree()    Phylogenetic Tree: [ 454 tips and 452 internal nodes ]

but the headers in the OTU table and the corresponding rows in the taxonomy table are actual (here shortened) sequences

> head(otu_table(ps)[1])
     GCAAGCGTTACTCGGAATCACTGGGCGTAAAGAGCGCGTAGGCGG#shortened
F3D0                                             0

> head(tax_table(ps)[1])
Taxonomy Table:     [1 taxa by 6 taxonomic ranks]:
                                                         Kingdom
GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGA#shortened "Bacteria"

Is there a way to combine the information from otu table and taxonomy table and replace the sequences with numbered OTU ids? I have checked several phyloseq resources and FAQs, but I can't find an answer to this.

I would like to have a table looking like this:

        taxonomy_OTU00001   taxonomy_OTU00002   taxonomy_OTU00003
F3D0    #counts             #counts             #counts
F3D1    #counts             #counts             #counts
F3D11   #counts             #counts             #counts
F3D125  #counts             #counts             #counts

As the workflow until this step is quite time consuming, I'm not sure how to provide a reproducible example for this problem.

EDIT: I generated a sample subset following dww's suggestion.

short_otu2 = short_otu = head(otu_table(ps)[,c(1:6)])  # seq as colnames 
short_tax2 = short_tax = tax_table(ps)[colnames(short_otu), ]  # seq as rownames
# shorten seqs, must still be unique
colnames(short_otu2) <- substr(colnames(short_otu), 0, 50)
rownames(short_tax2) <- substr(rownames(short_tax), 0, 50)

library(phyloseq)
> dput(short_otu2)
new("otu_table", .Data = structure(c(526L, 375L, 2931L, 994L,
2061L, 419L, 319L, 330L, 1737L, 623L, 1868L, 350L, 402L, 207L,
1880L, 577L, 887L, 303L, 413L, 64L, 838L, 698L, 939L, 484L, 146L,
126L, 496L, 440L, 1183L, 184L, 462L, 37L, 26L, 782L, 271L, 310L
), .Dim = c(6L, 6L), .Dimnames = list(c("F3D0", "F3D1", "F3D11",
"F3D125", "F3D13", "F3D141"), c("GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGAT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCT", "GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTGT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTTT", "CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGATTGT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGC"))), taxa_are_rows = FALSE)

> dput(short_tax2)
new("taxonomyTable", .Data = structure(c("Bacteria", "Bacteria",
"Bacteria", "Bacteria", "Bacteria", "Bacteria", "Bacteroidetes",
"Bacteroidetes", "Bacteroidetes", "Bacteroidetes", "Bacteroidetes",
"Bacteroidetes", "Bacteroidia", "Bacteroidia", "Bacteroidia",
"Bacteroidia", "Bacteroidia", "Bacteroidia", "Bacteroidales",
"Bacteroidales", "Bacteroidales", "Bacteroidales", "Bacteroidales",
"Bacteroidales", "Bacteroidales_S24-7_group", "Bacteroidales_S24-7_group",
"Bacteroidales_S24-7_group", "Bacteroidales_S24-7_group", "Bacteroidaceae",
"Bacteroidales_S24-7_group", NA, NA, NA, NA, "Bacteroides", NA
), .Dim = c(6L, 6L), .Dimnames = list(c("GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGAAGAT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGCAGGCGGACTCT", "GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTGT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGGCTTT", "CCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGATTGT",
"GCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCCTGC"), c("Kingdom",
"Phylum", "Class", "Order", "Family", "Genus"))))

I think a simple merge operation is needed, but to verify we need sample data. You should subset your tables to make some reproducible data for this Q. Assuming that the gene sequences are the row names in each table, try: `short_otu = head(otu_table(ps))`, and then for a corresponding tax table, do `short_tax = tax_table(ps)[rownames(short_otu), ]`. Yous could then shorten the row names using `substr` and add to the qeustion using `dput` — dww, Oct 28 '18 at 14:49
Hi - I had a quick look, but can't seem to get the format you want. There don't seem t be any sample data, to attach the "sample_100", "sample_101"... names. Maybe someone else with more familiarty of phyloseq can help better — dww, Oct 29 '18 at 12:48
@dww Ah I'm sorry, "sample_100", etc are the actual sample names for my own data. in the provided example it is "F3D0", "F3D1", "F3D11", "F3D125", "F3D13", "F3D141" etc — crazysantaclaus, Oct 29 '18 at 12:59

crazysantaclaus · Accepted Answer · 2018-11-06T16:50:29.890

This is to the part of the question "replace actual sequences with OTU ids (Phyloseq/dada2)?"

I contacted the phyloseq/dada2 developers and based on Susan Holmes' reply (https://github.com/joey711/phyloseq/issues/1030) I came up with this piece of code to replace the amplicon sequences with a numbered OTU header.

Further discussion can be found here: https://github.com/joey711/phyloseq/issues/213

# this changes the header from the actual sequence to Seq_001, Seq_002 etc
taxa_names(ps)
n_seqs <- seq(ntaxa(ps))
len_n_seqs <- nchar(max(n_seqs))
taxa_names(ps) <- paste("Seq", formatC(n_seqs, 
                                            width = len_n_seqs, 
                                            flag = "0"), sep = "_")
taxa_names(ps)

A possible way to get taxonomy included into the header is the following (continuing from above):

# generate a vector containing the full taxonomy path for all OTUs
wholetax <- do.call(paste, c(as.data.frame(tax_table(ps))
                  [c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus")], 
                  sep = "__"))  # to distinguish from "_" within tax ranks

# turn the otu_table into a data.frame
otu_export <- as.data.frame(otu_table(ps))
tmp <- names(otu_export)

# paste wholetax and OTU_ids together
for(i in 1:length(tmp)){
names(tmp)[i] = paste(wholetax[i], tmp[i], sep = "__")
}

# overwrite old names
names(otu_export) <- names(tmp)

> head(otu_export)[5]

# output:  
     Bacteria__Bacteroidetes__Bacteroidia__Bacteroidales__Bacteroidaceae__Bacteroides__Seq_005
F3D0                                                                                         146
F3D1                                                                                         126
F3D11                                                                                        496
F3D125                                                                                       440
F3D13                                                                                       1183
F3D141                                                                                       184

This yet does not include a test of correct sorting between the tables! So make sure the pasting and overwriting is correct.

This way you have a data.frame containing taxonomy "splittable" for each taxonomic rank, the OTU id, the sample name and the counts in one file. But apart from the export file, you still maintain the phyloseq-structure, where the OTU_ids link the different tables such as otu_table() and tax_table(). Another way would be to provide the wholetax vector to the taxa_names() command, I haven't tested that.

Suggestions for improvements are highly welcomed!

combine OTU and tax table and replace actual sequences with OTU ids (Phyloseq/dada2)

EDIT: I generated a sample subset following dww's suggestion.

1 Answers1