I'm using phyloseq
a lot for my work. My data sets often contain multiple conditions or parameters, which need to be analyzed in the same way (for example the same plot for Bacteria in Summer or Winter AND in Lake1 or Lake2), so I wanted to use functions for that. I wrote a subset function which allows me to combine multiple parameters by looping over them. The output is stored in lists for further analysis.
However, this seems quite clumsy. So my first question would regard the improvement of the function.
1) Specifically, I wonder if the
a) use of multiple for loops
to generate the subsets is a good idea.
b) Also, the combination of for loops
and lapply
could be optimized. And
c) maybe there is a better way to prevent existing lists from being unrecognizedly appended with a new iteration of the same objects again? I implemented this as I had many, many test executions while developing the code
Whether for loops are slower than apply in general was discussed here: lapply vs for loop - Performance R
I think phyloseq
internally calls which
, so it does not have to be a phyloseq
specific solution.
2) My second question would be how to handle the case, if not all search parameters are present in all subsets? So in the example below, the combination of "danish" and "M" would break if there were no danish males. I would like to avoid that and in this example just have 3 (danish x F, american x F, american x M) instead of 4 subsets. At the moment the function needs to be adapted to every special subset which kind of ruins the purpose of writing it in the first place.
library(phyloseq)
data(enterotype)
# reduce the size of the data set
phyloseq <- filter_taxa(enterotype, function (x) {sum(x > 0.001) >= 1}, prune = TRUE)
# arguments for the subsetting function
phyloseq_object <- phyloseq
Nationality <- c("american", "danish")
Gender <- c("F", "M")
# define a function to obtain sample subsets from the phyloseq object
# per combination of parameters
get_sample_subsets <- function(phyloseq_object, nation, gender) {
sample_subset <- sample_data(phyloseq_object)[ which(sample_data(phyloseq_object)$Nationality == nation &
sample_data(phyloseq_object)$Gender == gender),]
phyloseq_subset <- merge_phyloseq(tax_table(phyloseq_object),
otu_table(phyloseq_object),
#refseq(phyloseq_object),
sample_subset)
phyloseq_subset2 <- filter_taxa(phyloseq_subset, function (x) {sum(x > 0) >= 1 }, prune = TRUE)
return(phyloseq_subset2)
}
# here we pass the arguments for subsetting over two for loops
# to create all possible combinations of the subset parameters etc.
# the subsets are stored within a list, which has to be empty before running the loops
sample_subset_list <- list()
if(length(sample_subset_list) == 0) {
for (nations in Nationality) {
for (gender in Gender) {
tmp <- get_sample_subsets(phyloseq_object = phyloseq_object,
nation = nations, gender = gender)
sample_subset_list[[paste(nations, gender, sep = "_")]] <- tmp
}
}
print(sample_subset_list)
} else {
print("list is not empty, abort to prevent appending...")
}
# You could now for example use the output to calculate ordinations for each subset (this data set has too few entries per subset for that)
# create a list where the distance metrics for the sample subsets are stored
ordination_nmds <- list()
ordination_nmds <- lapply(sample_subset_list, ordinate, method = "NMDS",
dist = "bray", try = 100, autotransform = TRUE)