R calculate most abundant taxa using phyloseq object

Question

I would like to know if my approach to calculate the average of the relative abundance of any taxon is correct !!!

If I want to know if, to calculate the relative abundance (percent) of each family (or any Taxon) in a phyloseq object (GlobalPattern) will be correct like:

    data("GlobalPatterns")

T <- GlobalPatterns %>% 
    tax_glom(., "Family") %>% 
    transform_sample_counts(function(x)100* x / sum(x)) %>% psmelt() %>% 
    arrange(OTU) %>% rename(OTUsID = OTU) %>% 
    select(OTUsID, Family, Sample, Abundance) %>%
    spread(Sample, Abundance)

T$Mean <- rowMeans(T[, c(3:ncol(T))])

FAM <- T[, c("Family", "Mean" ) ]

#order data frame  
FAM <- FAM[order(dplyr::desc(FAM$Mean)),]
rownames(FAM) <- NULL

head(FAM)
          Family          Mean
1    Bacteroidaceae     7.490944
2   Ruminococcaceae     6.038956
3   Lachnospiraceae     5.758200
4 Flavobacteriaceae     5.016402
5  Desulfobulbaceae     3.341026
6            ACK-M1     3.242808

in this case the Bacteroidaceae were the most abundant family in all the samples of GlobalPattern (26 samples and 19216 OTUs), it was present in 7.49% in average in 26 samples !!!! It’s correct to make the T$Mean <- rowMeans(T[, c(3:ncol(T))]) to calculate the average any given Taxon ?

score 1 · Answer 1 · answered Jun 10 '22 at 08:15

Bacteroidaceae has the highest abundance, if all samples were pooled together. However, it has the highest abundance in only 2 samples. Nevertheless, there is no other taxon having a higher abundance in an average sample.

Let's use dplyr verbs for all the steps to have a more descriptive and consistent code:

library(tidyverse)
library(phyloseq)
#> Creating a generic function for 'nrow' from package 'base' in package 'biomformat'
#> Creating a generic function for 'ncol' from package 'base' in package 'biomformat'
#> Creating a generic function for 'rownames' from package 'base' in package 'biomformat'
#> Creating a generic function for 'colnames' from package 'base' in package 'biomformat'
data(GlobalPatterns)

data <-
  GlobalPatterns %>%
  tax_glom("Family") %>%
  transform_sample_counts(function(x)100* x / sum(x)) %>%
  psmelt() %>%
  as_tibble()

# highest abundance: all samples pooled together
data %>%
  group_by(Family) %>%
  summarise(Abundance = mean(Abundance)) %>%
  arrange(-Abundance)
#> # A tibble: 334 × 2
#>    Family             Abundance
#>    <chr>                  <dbl>
#>  1 Bacteroidaceae          7.49
#>  2 Ruminococcaceae         6.04
#>  3 Lachnospiraceae         5.76
#>  4 Flavobacteriaceae       5.02
#>  5 Desulfobulbaceae        3.34
#>  6 ACK-M1                  3.24
#>  7 Streptococcaceae        2.77
#>  8 Nostocaceae             2.62
#>  9 Enterobacteriaceae      2.55
#> 10 Spartobacteriaceae      2.45
#> # … with 324 more rows

# sanity check: is total abundance of each sample 100%?
data %>%
  group_by(Sample) %>%
  summarise(Abundance = sum(Abundance)) %>%
  pull(Abundance) %>%
  `==`(100) %>%
  all()
#> [1] TRUE

# get most abundant family for each sample individually
data %>%
  group_by(Sample) %>%
  arrange(-Abundance) %>%
  slice(1) %>%
  select(Family) %>%
  ungroup() %>%
  count(Family, name = "n_samples") %>%
  arrange(-n_samples)
#> Adding missing grouping variables: `Sample`
#> # A tibble: 18 × 2
#>    Family              n_samples
#>    <chr>                   <int>
#>  1 Desulfobulbaceae            3
#>  2 Bacteroidaceae              2
#>  3 Crenotrichaceae             2
#>  4 Flavobacteriaceae           2
#>  5 Lachnospiraceae             2
#>  6 Ruminococcaceae             2
#>  7 Streptococcaceae            2
#>  8 ACK-M1                      1
#>  9 Enterobacteriaceae          1
#> 10 Moraxellaceae               1
#> 11 Neisseriaceae               1
#> 12 Nostocaceae                 1
#> 13 Solibacteraceae             1
#> 14 Spartobacteriaceae          1
#> 15 Sphingomonadaceae           1
#> 16 Synechococcaceae            1
#> 17 Veillonellaceae             1
#> 18 Verrucomicrobiaceae         1

^{Created on 2022-06-10 by the reprex package (v2.0.0)}

R calculate most abundant taxa using phyloseq object

1 Answers1