0

Working in R on genomic data.

I'm trying to subset a very large melted phyloseq table, which includes a column of phylum IDs, in order to remove rows containing phyla that occur less than 100000 times in the table. I might have missed an "easy" way to do this, but I eventually ended up trying to make my own function.

The function:

phylum_subset <- function(x = melt.ALKSS_few, #melted physeq object 
                          Count = melt.ALKSS$Phylum, #counting phyla
                          Value = 1000   #minimum number of OTUs
                          ){
  phyla.table <- table(x$Count) 
  
  for(Count in x){if(phyla.table[Count]<=100000)
                        subset(x,Phylum != Count)
  }
}

I will grant that this is my first time writing a function and I don't really know what I'm doing.

My function input and resulting error output ends up like so:

melt.ALKSS_few.count <- phylum_subset(x = melt.ALKSS_few,Count = melt.ALKSS_few$Phylum,Value = 100000)
Error in if (phyla.table[Count] <= 1e+05) subset(x, Count != Phylum_col) : 
  the condition has length > 1

Because I'm trying to subset by a sum of occurences in a column, across all occurences in that column, I couldn't just use filter() or something once (unless I wanted to do that 500 times). Surely someone has done something like this before?

Edit: OK, trying to provide a reproducible chunk of my dataset. Be warned, it's got over 808k obs of 47 variables because doing genomics on an ecological dataset is a mess. I've removed some variables that are remnants of metadata for previous steps (primer sequences, etc.) that I won't be using in analysis just to keep the code block... less massive.

> dput(droplevels(head(melt.ALKSS_few)))
structure(list(OTU = c("44c21e29adae97a53247abbd73978395", "0f18144d308ada95632ab5193d92073f", 
"d829bee4984f82ffc2453212157caf96", "0f18144d308ada95632ab5193d92073f", 
"0ddcd311e02f742e2e0e61ce02cf9c29", "120eba657e42a11a5c29f97b90f02035"
), Sample = c("S438", "S680", "S437", "S345", "S454", "S513"), 
    Abundance = c(10755, 9568, 8186, 7621, 7506, 7501), BarcodeSequence = c("CATTTTAGGACT", 
    "CGGAATAGAGTA", "CATTTTAGAGTA", "TATAATGGACCA", "CGGAATTGGCAT", 
    "GACGACGGACCA"), PrimerDesc = c("16S", 
    "16S", "16S", "16S", "16S", "16S"), SampleName = c("06222021KC-2-R", 
    "09292021KC-2-R", "06222021KC-1-R", "06032021KC-1-R", "06292921KC-3-R", 
    "06302021KC-3-R"), Project = c("16SLBSKR1-", "16SLBSKR2-", 
    "16SLBSKR1-", "16SLBSKR1-", "16SLBSKR1-", "16SLBSKR2-"), 
    Number = c("456", "694", "455", "363", "471", "491"), Date = c("6_22_2021", "9_29_2021", "6_22_2021", "6_3_2021", 
    "6_29_2021", "6_30_2021"), Year = c(2021L, 2021L, 2021L, 
    2021L, 2021L, 2021L), Season = c("Summer", "Fall", "Summer", 
    "Summer", "Summer", "Summer"), sample_Species = c("Little_Bluestem", 
    "Little_Bluestem", "Little_Bluestem", "Little_Bluestem", 
    "Little_Bluestem", "Little_Bluestem"), SoloOrMixed = c("Solo", 
    "Mixed", "Solo", "Mixed", "Mixed", "Solo"), Location = c("Tyler_SP", "Hy_180", "Tyler_SP", "Roadside_Hy67", 
    "Copper_Breaks_SP", "Caprock_Canyons_SP"), Ecoregion = c("South_Central_Plains", 
    "South_Central_Plains", "South_Central_Plains", "Edwards_Plateau", 
    "Southwestern_Tablelands", "Southwestern_Tablelands"), Habitat = c("Forest", 
    "Roadside", "Forest", "Roadside", "Roadside", "AridRock"), 
    Source = c("Root", "Root", "Root", "Root", "Root", "Root"
    ),  PrecipMonth = c(96.65, 
    37.45, 96.65, 125.94, 125.01, 153.94), PrecipDaysSince = c(1L, 
    1L, 1L, 1L, 1L, 0L), pH = c(6.8, 6.7, 6.8, 8, 8, 7.8), EC = c(139L, 
    182L, 139L, 161L, 125L, 2370L), NO3 = c(0, 4.4, 0, 0.2, 2.2, 
    3.4), P = c(16L, 17L, 16L, 14L, 5L, 6L), K = c(145L, 84L, 
    145L, 114L, 160L, 65L), Ca = c(3918L, 2159L, 3918L, 27256L, 
    6609L, 16508L), Mg = c(166L, 130L, 166L, 188L, 148L, 95L), 
    S = c(10L, 16L, 10L, 24L, 24L, 14299L), Na = c(4L, 3L, 4L, 
    4L, 4L, 4L), Fe = c(19.76, 17, 19.76, 2.31, 1, 0), Zn = c(2.28, 
    15.1, 2.28, 7.01, 0.8, 0.1), Mn = c(64.16, 19, 64.16, 27.01, 
    15, 6), Cu = c(0.16, 0.2, 0.16, 0.16, 0.2, 0.2), Kingdom = c("d__Bacteria", 
    "d__Bacteria", "d__Bacteria", "d__Bacteria", "d__Bacteria", 
    "d__Bacteria"), Phylum = c("Proteobacteria", "Proteobacteria", 
    "Proteobacteria", "Proteobacteria", "Proteobacteria", "Actinobacteriota"
    ), Class = c("Gammaproteobacteria", "Gammaproteobacteria", 
    "Alphaproteobacteria", "Gammaproteobacteria", "Gammaproteobacteria", 
    "Actinobacteria"), Order = c("Xanthomonadales", "Pseudomonadales", 
    "Rhizobiales", "Pseudomonadales", "Pseudomonadales", "Streptomycetales"
    ), Family = c("Rhodanobacteraceae", "Pseudomonadaceae", "Xanthobacteraceae", 
    "Pseudomonadaceae", "Pseudomonadaceae", "Streptomycetaceae"
    ), Genus = c("Rhodanobacter", "Pseudomonas", "Bradyrhizobium", 
    "Pseudomonas", "Pseudomonas", "Streptomyces"), Species = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_)), row.names = c(2352002L, 511171L, 7348565L, 
510815L, 468295L, 621043L), class = "data.frame")
  • It would help if you [make this question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including a small representative dataset in a plain text format, to show what `melt.ALKSS_few` looks like. I suspect that you can use the dplyr functions `group_by`, `mutate` and `filter` to achieve the result, but it's difficult to help without seeing some data. – neilfws Aug 08 '22 at 23:48
  • @neilfws - Made the edit you suggested, I think. Obviously wasn't able to include enough to give you a set that will allow to filter by phyla with 100k instances. By random chance, the head of my data seems to have been Proteobacteria-heavy. – carrollk2015 Aug 09 '22 at 04:36
  • What is `melt.ALKSS` in `Count = melt.ALKSS$Phylum, #counting phyla`? – Armali Aug 09 '22 at 06:55
  • 1
    I ended up creating a table - `table(melt.ALKSS_few$Phylum)` - to look at number of occurrences and subsetting - `subset(melt.ALKSS_few, Phylum != "[Insert Phylum Here]")` - one phylum at a time to get it done. Thankfully ended up being less than thirty phyla. Still interested in if there's a few-line solution, but that was my fix. – carrollk2015 Aug 09 '22 at 07:09
  • @Armali - That's an older version of `melt.ALKSS_few` without the samples with unassigned phylum (`NA`) removed. I'm not sure it makes a difference there, since you see I call the correct datatset in the function call itself in the second code block. – carrollk2015 Aug 09 '22 at 07:12

1 Answers1

0

See below for a tidyverse solution. Note that I've used GlobalPatterns from phyloseq to create a reproducible example.

require("phyloseq")
require("tidyverse")

# Load the data and melt it
data(GlobalPatterns)
psdf <- psmelt(GlobalPatterns)

# Function to subset a dataframe based on the size of each group
# in a grouping variable
subset_by_freq <- function(df, grouping_var, threshold){
  df %>%
    group_by(!!sym(grouping_var)) %>%
    filter(n() >= threshold) %>%
    ungroup()
}

# Filter out taxa with less than 1e5 counts
psdf_sub <- subset_by_freq(psdf, "Phylum", 1e5)

# Sanity check: count the number of rows per taxon
psdf_sub %>%
  group_by(Phylum) %>%
  tally()
#> # A tibble: 2 x 2
#>   Phylum              n
#>   <chr>           <int>
#> 1 Firmicutes     113256
#> 2 Proteobacteria 166816

Created on 2022-08-12 by the reprex package (v2.0.1)

gmt
  • 325
  • 1
  • 7