3

I'm using phyloseq a lot for my work. My data sets often contain multiple conditions or parameters, which need to be analyzed in the same way (for example the same plot for Bacteria in Summer or Winter AND in Lake1 or Lake2), so I wanted to use functions for that. I wrote a subset function which allows me to combine multiple parameters by looping over them. The output is stored in lists for further analysis.

However, this seems quite clumsy. So my first question would regard the improvement of the function.

1) Specifically, I wonder if the

a) use of multiple for loops to generate the subsets is a good idea.

b) Also, the combination of for loops and lapply could be optimized. And

c) maybe there is a better way to prevent existing lists from being unrecognizedly appended with a new iteration of the same objects again? I implemented this as I had many, many test executions while developing the code

Whether for loops are slower than apply in general was discussed here: lapply vs for loop - Performance R

I think phyloseq internally calls which, so it does not have to be a phyloseq specific solution.

2) My second question would be how to handle the case, if not all search parameters are present in all subsets? So in the example below, the combination of "danish" and "M" would break if there were no danish males. I would like to avoid that and in this example just have 3 (danish x F, american x F, american x M) instead of 4 subsets. At the moment the function needs to be adapted to every special subset which kind of ruins the purpose of writing it in the first place.

library(phyloseq)
data(enterotype)
# reduce the size of the data set
phyloseq <- filter_taxa(enterotype, function (x) {sum(x > 0.001) >= 1}, prune = TRUE)

# arguments for the subsetting function
phyloseq_object <- phyloseq
Nationality <- c("american", "danish")
Gender <- c("F", "M")

# define a function to obtain sample subsets from the phyloseq object 
# per combination of parameters
get_sample_subsets <- function(phyloseq_object, nation, gender) {
  sample_subset <- sample_data(phyloseq_object)[ which(sample_data(phyloseq_object)$Nationality == nation &
    sample_data(phyloseq_object)$Gender == gender),]
  phyloseq_subset <- merge_phyloseq(tax_table(phyloseq_object),
    otu_table(phyloseq_object),
    #refseq(phyloseq_object),
    sample_subset)
  phyloseq_subset2 <- filter_taxa(phyloseq_subset, function (x) {sum(x > 0) >= 1 }, prune = TRUE)
  return(phyloseq_subset2)
}

# here we pass the arguments for subsetting over two for loops
# to create all possible combinations of the subset parameters etc.
# the subsets are stored within a list, which has to be empty before running the loops 
sample_subset_list <- list()
if(length(sample_subset_list) == 0) {
  for (nations in Nationality) {
    for (gender in Gender) {
      tmp <- get_sample_subsets(phyloseq_object = phyloseq_object,
        nation = nations, gender = gender)
      sample_subset_list[[paste(nations, gender, sep = "_")]] <- tmp
    }
  }
  print(sample_subset_list)
} else {
  print("list is not empty, abort to prevent appending...")
}

# You could now for example use the output to calculate ordinations for each subset (this data set has too few entries per subset for that)

# create a list where the distance metrics for the sample subsets are stored
ordination_nmds <- list()
ordination_nmds <- lapply(sample_subset_list, ordinate, method = "NMDS",
  dist = "bray", try = 100, autotransform = TRUE)
crazysantaclaus
  • 613
  • 5
  • 19
  • Could you be a bit more specific in the first part of your question? Which parts are you exactly looking at for better ideas? The latter seems somewhat simple, but it is not clear from your question what your first question entails. – Oliver Nov 14 '19 at 10:23
  • Have you tried to use `split()` function to get the expected list with subsets? https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/split. – Paul Nov 14 '19 at 10:47
  • I think you are looking for a combine use of `split()` and `lapply()`. The 1st allows you to easily create subsets and store them in a list. The `lapply()` can replace your `for()` loop. This makes your code more robust, easy to debug and faster (in my opinion). – Paul Nov 14 '19 at 10:53
  • @Oliver: I tried to make my with the function issues more clear. Let me know if that helps you – crazysantaclaus Nov 14 '19 at 12:03

1 Answers1

0

Works for S3 but not for S4 (see comments)

As I am not familiar with S4 I might delete this answer if something better comes out.

Based on my comment here is something that might help you. Please let me know if you need a better solution or if it does not answer your problem.

# I changed the data because "phyloseq" package require further install
    ex_data = mtcars

# this line might replace your "get_sample_subsets" function and your loop to check if they are empty lists
# You can modify the elements inside list(...) to get the wanted subsets, it is very flexible
    sampled_data = split(ex_data, list(ex_data$cyl, ex_data$vs), drop = TRUE) # note the drop = TRUE, to avoid "empty" elements
Paul
  • 2,850
  • 1
  • 12
  • 37
  • the `split()` approach looks very promising, but the objects of class `phyloseq` are more complicated to access, do you know how to adjust for that? I think `phyloseq` should be available for R 3.6 using `source('http://bioconductor.org/biocLite.R') biocLite('phyloseq')`? – crazysantaclaus Nov 14 '19 at 12:41
  • 1
    @Paul This is an`S4 object` [http://adv-r.had.co.nz/OO-essentials.html#s4] . Here is a relevant answer: [https://stackoverflow.com/questions/10961842/how-to-define-the-subset-operators-for-a-s4-class/10961998#10961998] – markhogue Nov 14 '19 at 13:02
  • 1
    Specifically for phyloseq, [https://joey711.github.io/phyloseq/preprocess.html] – markhogue Nov 14 '19 at 13:09
  • My bad, did not know about this S4 object problem, I will try to find a better solution with the edited question and your comments :) – Paul Nov 14 '19 at 13:18