1

I want to estimate predicted values for the mean (mu) and standard deviation (sigma) based on a gamlss model. However, it is not clear to me how to extract a standard deviation for given values of x

The data frame I am using looks like this:

#> head(abdom)
#  y     x
# 59 12.29
# 64 12.29
# 56 12.29

Here is the code to fit a gamlss model:

library(gamlss)
fit = gamlss(y ~ cs(x), sigma.formula = ~ cs(x), data = abdom, family = BCPE)

I want to calculate z-scores based on this model using the following approach: z = (y - mu)/sigma . Therefore, I use this code to calculate mu and sigma for each value of y and calculate the z scores. 95% of the z scores should lie between -2 and 2.

using predict function

mu = predict(fit, newdata = abdom, type = "response", what = "mu")
si = predict(fit, newdata = abdom, type = "response", what = "sigma")
z_score1 = (abdom$y - mu) / si
hist(z_score1)

using centiles.pred function

z_score2 = centiles.pred(fit, xname = "x", xvalues = abdom$x, yval = abdom$y, type = "z-scores")
hist(z_score2)

This leads to the following plots:

enter image description here

for z_score1, most scores are not even close to lie between -2 and 2.

Another way to approach this is by plotting the mean and standard deviation:

# calculating mu +/- 2*sigma
pred_dat = data.frame(x = 10:45)
mu = predict(fit, newdata = pred_dat, type = "response", what = "mu")
si = predict(fit, newdata = pred_dat, type = "response", what = "sigma")
hi = mu + (2 * si)
lo = mu - (2 * si)

pred_dat$mu = mu
pred_dat$hi = hi
pred_dat$lo = lo

# plotting
ggplot(data = pred_dat, aes(x = x)) +
  geom_point(data = abdom, aes(x = x, y = y)) +
  geom_line(aes(y = mu), colour = "red") +
  geom_line(aes(y = hi), colour = "blue") +
  geom_line(aes(y = lo), colour = "blue")

yielding the following plot:

enter image description here

Again, 95% of the values should lie between the two blue lines (hi and lo). But the values of the standard deviations are so low that there seems to be only one line.

So my questions are: first question: what do the values derived from predict represent if not the standard deviation conditional to x?

second question: how can I predict the standard deviation for a given x-value?

ehi
  • 409
  • 8
  • 23
  • I'm upvoting your question, because I consider it to be very well presented. But at the same time, I'm voting to move it to [Cross Validated](https://stats.stackexchange.com), as most likely it belongs there and not SO. – PavoDive Jan 02 '23 at 19:37
  • I found that in bcpe the scale-parameter sigma is not the standard deviation but the centile-based coefficient of variance, that is calculated based on interquartile range devided by the median. However, my question remains: how can I get the standard deviation for a given value of x? – ehi Jan 03 '23 at 15:58

2 Answers2

1

For BCPE the z-scores are not (y-mu)/sigma.

For any gamlss fit, the z-scores are exactly equal to the residuals of the fitted model, i.e. for your model fit

resid(fit)

or

fit$residuals

Robert
  • 186
  • 2
  • Thanks for your answer. In my example, however, this is almost true but there are slight differences between ```resid(fit1)``` and ```centiles.pred(fit, xname = "x", xvalues = abdom$x, yval = abdom$y, type = "z-scores")```. – ehi Jan 05 '23 at 16:08
  • But: my goal is to create a nomogram with three curves: the expected values (mu) and two lines representing a measure of dispersion - preferably the standard deviation. Threrefore, my question is: how can I get a measure of dispersion for a given value of x? – ehi Jan 05 '23 at 16:14
1

The gamlss package provides distribution functions for the BCPE distribution, including qBCPE. If you plug the coefficients from your model into this function at pnorm(1), then you will get the predicted value of y at 1 standard deviation above the predicted mean. Since you can get the predicted mean with predict(fit), then you can easily get the standard deviation. The difficult part is getting the parameters from your model into qBCPE. Here's a reprex:

library(gamlss)
library(ggplot2)

fit <- gamlss(y ~ cs(x), sigma.formula = ~ cs(x), data = abdom, family = BCPE)

Q <- qBCPE(pnorm(1), 
           mu    = predict(fit),
           sigma = exp(fit$sigma.coefficients[1] + 
                   fit$sigma.coefficients[2] * cs(abdom$x)),
           nu =    fit$nu.coefficients, 
           tau =   exp(fit$tau.coefficients)) 

SD <- c(Q - predict(fit))

Here, SD gives the vector of standard deviations at each value of x:

head(SD)
#> [1] 4.092467 4.092467 4.092467 4.203738 4.425361 4.425361

To show this is correct, let's plot 1.96 standard deviations on either side of the prediction line:

ggplot(data = data.frame(x = abdom$x, y = predict(fit),
                         upper = predict(fit) + 1.96 * SD,
                         lower = predict(fit) - 1.96 * SD), aes(x, y)) +
  geom_point(data = abdom) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3) +
  geom_line(color = "blue", linewidth = 1)

This looks good. Let's confirm that about 5% of observations lie outside 1.96 standard deviations of the mean:

(sum(abdom$y > predict(fit) + 1.96 * SD) +
  sum(abdom$y < predict(fit) - 1.96 * SD)) / nrow(abdom)
#> [1] 0.0557377

And let's show that the calculated Z scores follow a standard normal distribution:

Z <- (abdom$y - predict(fit))/SD
hist(Z, breaks = 20, freq = FALSE)
lines(seq(-4, 4, 0.1), dnorm(seq(-4, 4, 0.1)))

enter image description here

This looks pretty good.

Created on 2023-01-09 with reprex v2.0.2

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87