I have some fish catch data. Each row contains a species name, a catch value (cpue), and some other unrelated identifying fields (year, location, depth, etc). This code will produce a dataset with the correct structure:
# a sample dataset
set.seed(1337)
fish = rbind(
data.frame(
spp = "Flounder",
cpue = rnorm(5, 5, 2)
),
data.frame(
spp = "Bass",
cpue = rnorm(5, 15, 1)
),
data.frame(
spp = "Cod",
cpue = rnorm(5, 2, 4)
)
)
I'm trying to create a normalized cpue column cpue_norm
. To do this, I apply the following function to each cpue value:
cpue_norm = (cpue - cpue_mean)/cpue_std
Where cpue_mean
and cpue_std
are, respectively, the mean and standard deviation of cpue
. The caveat is that I need to do this by each species i.e. when I calculate the cpue_norm
for a particular row, I need to calculate the cpue_mean
and cpue_std
using cpue
from only that species.
The trouble is that all of the species are in the same dataset. So for each row, I need to calculate the mean and standard deviation of cpue for that species and then use those values to calculate cpue_norm.
I've been able to make some headway with tapply:
calc_cpue_norm = function(l) {
return((l - mean(l))/sd(l))
}
tapply(fish$cpue, fish$spp, calc_cpue_norm)
but I end up with lists when I need to be adding these values to the dataframe rows instead.
Anyone who knows R better than me have some wisdom to share?