1

I'm building a little function in R that takes size measurements from several species and several sites, combines all the data by site (lumping many species together), and then computes some statistics on those combined data.

Here is some simplistic sample data:

SiteID <- rep(c("D00002", "D00003", "D00004"), c(5, 2, 3))
SpeciesID <- c("CHIL", "CHIP", "GAM", "NZMS", "LUMB", "CHIL", "SIMA", "CHIP", "CHIL", "NZMS")
Counts <- data.frame(matrix(sample(0:99,200, replace = TRUE), nrow = 10, ncol = 20))
    colnames(Counts) <- paste0('B', 1:20)
spec <- cbind(SiteID, SpeciesID, Counts)
stat1 <- data.frame(unique(SiteID))
    colnames(stat1) <- 'SiteID'
    stat1$Mean <- NA

Here is the function, which creates a list, lsize1, where each list element is a vector of the sizes (B1 to B20) for a given SpeciesID in a given SiteID, multiplied by the number of counts for each size class. From this, the function creates a list, lsize2, which combines list elements from lsize1 that have the same SiteID. Finally, it gets the mean of each element in lsize2 (i.e., the average size of an individual for each SiteID, regardless of SpeciesID), and outputs that as a result.

fsize <- function(){
    specB <- spec[, 3:22]
    lsize1 <- apply(specB, 1, function(x) rep(1:20, x))
        names(lsize1) <- spec$SiteID
    lsize2 <- sapply(unique(names(lsize1)), function(x) unlist(lsize1[names(lsize1) == x], use.names = FALSE), simplify = FALSE)
        stat1[stat1$SiteID %in% names(lsize2), 'Mean'] <- round(sapply(lsize2, mean), 2)
    return(stat1)
}

In creating this function, I followed the suggestion here: combine list elements based on element names, which gets at the crux of my problem: combining list elements based on some criteria in common (in my case, combining all elements from the same SiteID). The function works as intended, but my question is if there's a way to make it substantially faster?

Note: for my actual data set, which is ~40,000 rows in length, I find that the function runs in ~ 0.7 seconds, with the most time consuming step being the creation of lsize2 (~ 0.5 seconds). I need to run this function many, many times, with different permutations and subsets of the data, so I'm hoping there's a way to cut this processing time down significantly.

cgjeff
  • 41
  • 1
  • 5

1 Answers1

1

There shouldn't be any need for loops here. Here's one attempt:

tmp <- data.frame(spec["SiteID"], sums = rowSums(specB * col(specB)), counts=rowSums(specB) )
tmp <- aggregate(. ~ SiteID, tmp, sum)
tmp$avg <- tmp$sums / tmp$counts
tmp

#  SiteID  sums counts      avg
#1 D00002 46254   4549 10.16795
#2 D00003 20327   1810 11.23039
#3 D00004 29651   2889 10.26341

Compare:

fsize()
#  SiteID  Mean
#1 D00002 10.17
#2 D00003 11.23
#3 D00004 10.26

This code essentially multiplies each value by it's index (col(specB)), then aggregates the sums and counts by SiteID. This logic should be relatively transferable to other methods (data.table/dplyr) as well. E.g.: in data.table:

setDT(spec)
spec[, .(avg = sum(.SD * col(.SD)) / sum(unlist(.SD))), by=SiteID, .SDcols=B1:B20]

#   SiteID      avg
#1: D00002 10.16795
#2: D00003 11.23039
#3: D00004 10.26341
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thanks. That works well for the mean. I'm also hoping to do something similar for other statistics too though, such as the median or standard deviation (in hindsight, I should have included those in my example). Any thoughts on how those might be calculated in a similar fashion? – cgjeff Nov 27 '17 at 23:54