0

I have some fish catch data. Each row contains a species name, a catch value (cpue), and some other unrelated identifying fields (year, location, depth, etc). This code will produce a dataset with the correct structure:

# a sample dataset
set.seed(1337)
fish = rbind(
  data.frame(
    spp = "Flounder",
    cpue = rnorm(5, 5, 2)
  ),
  data.frame(
    spp = "Bass",
    cpue = rnorm(5, 15, 1)
  ),
  data.frame(
    spp = "Cod",
    cpue = rnorm(5, 2, 4)
  )
)

I'm trying to create a normalized cpue column cpue_norm. To do this, I apply the following function to each cpue value:

cpue_norm = (cpue - cpue_mean)/cpue_std

Where cpue_mean and cpue_std are, respectively, the mean and standard deviation of cpue. The caveat is that I need to do this by each species i.e. when I calculate the cpue_norm for a particular row, I need to calculate the cpue_mean and cpue_std using cpue from only that species.

The trouble is that all of the species are in the same dataset. So for each row, I need to calculate the mean and standard deviation of cpue for that species and then use those values to calculate cpue_norm.

I've been able to make some headway with tapply:

calc_cpue_norm = function(l) {
  return((l - mean(l))/sd(l))
}

tapply(fish$cpue, fish$spp, calc_cpue_norm)

but I end up with lists when I need to be adding these values to the dataframe rows instead.

Anyone who knows R better than me have some wisdom to share?

Canadian_Marine
  • 479
  • 1
  • 4
  • 10
  • Question closed as duplicated. Use this `fish <- fish %>% dplyr::group_by(spp) %>% dplyr::mutate(cpue_norm = (cpue - mean(cpue, na.rm = TRUE)) / sd(cpue, na.rm = TRUE))` – Claudiu Papasteri Feb 22 '21 at 16:07
  • data.table: `as.data.table(fish)[, cpue_norm := (cpue - mean(cpue))/sd(cpue), by = .(spp)]`; base R: `ave(fish$cpue, fish$spp, FUN = function(z) (z-mean(z))/sd(z))`; and you have dplyr from above. – r2evans Feb 22 '21 at 16:10
  • Works perfect. Thanks! – Canadian_Marine Feb 22 '21 at 16:38

0 Answers0