I'm building a little function in R that takes size measurements from several species and several sites, combines all the data by site (lumping many species together), and then computes some statistics on those combined data.
Here is some simplistic sample data:
SiteID <- rep(c("D00002", "D00003", "D00004"), c(5, 2, 3))
SpeciesID <- c("CHIL", "CHIP", "GAM", "NZMS", "LUMB", "CHIL", "SIMA", "CHIP", "CHIL", "NZMS")
Counts <- data.frame(matrix(sample(0:99,200, replace = TRUE), nrow = 10, ncol = 20))
colnames(Counts) <- paste0('B', 1:20)
spec <- cbind(SiteID, SpeciesID, Counts)
stat1 <- data.frame(unique(SiteID))
colnames(stat1) <- 'SiteID'
stat1$Mean <- NA
Here is the function, which creates a list, lsize1
, where each list element is a vector of the sizes (B1
to B20
) for a given SpeciesID
in a given SiteID
, multiplied by the number of counts for each size class. From this, the function creates a list, lsize2
, which combines list elements from lsize1
that have the same SiteID
. Finally, it gets the mean of each element in lsize2
(i.e., the average size of an individual for each SiteID
, regardless of SpeciesID
), and outputs that as a result.
fsize <- function(){
specB <- spec[, 3:22]
lsize1 <- apply(specB, 1, function(x) rep(1:20, x))
names(lsize1) <- spec$SiteID
lsize2 <- sapply(unique(names(lsize1)), function(x) unlist(lsize1[names(lsize1) == x], use.names = FALSE), simplify = FALSE)
stat1[stat1$SiteID %in% names(lsize2), 'Mean'] <- round(sapply(lsize2, mean), 2)
return(stat1)
}
In creating this function, I followed the suggestion here: combine list elements based on element names, which gets at the crux of my problem: combining list elements based on some criteria in common (in my case, combining all elements from the same SiteID
). The function works as intended, but my question is if there's a way to make it substantially faster?
Note: for my actual data set, which is ~40,000 rows in length, I find that the function runs in ~ 0.7 seconds, with the most time consuming step being the creation of lsize2
(~ 0.5 seconds). I need to run this function many, many times, with different permutations and subsets of the data, so I'm hoping there's a way to cut this processing time down significantly.