9

Background

Right now, I'm creating a multiple-predictor linear model and generating diagnostic plots to assess regression assumptions. (It's for a multiple regression analysis stats class that I'm loving at the moment :-)

My textbook (Cohen, Cohen, West, and Aiken 2003) recommends plotting each predictor against the residuals to make sure that:

  1. The residuals don't systematically covary with the predictor
  2. The residuals are homoscedastic with respect to each predictor in the model

On point (2), my textbook has this to say:

Some statistical packages allow the analyst to plot lowess fit lines at the mean of the residuals (0-line), 1 standard deviation above the mean, and 1 standard deviation below the mean of the residuals....In the present case {their example}, the two lines {mean + 1sd and mean - 1sd} remain roughly parallel to the lowess {0} line, consistent with the interpretation that the variance of the residuals does not change as a function of X. (p. 131)

How can I modify loess lines?

I know how to generate a scatterplot with a "0-line,":

    # First, I'll make a simple linear model and get its diagnostic stats
    library(ggplot2)
    data(cars)
    mod <- fortify(lm(speed ~ dist, data = cars))
    attach(mod)
    str(mod)

    # Now I want to make sure the residuals are homoscedastic
    qplot (x = dist, y = .resid, data = mod) + 
    geom_smooth(se = FALSE) # "se = FALSE" Removes the standard error bands

But does anyone know how I can use ggplot2 and qplot to generate plots where the 0-line, "mean + 1sd" AND "mean - 1sd" lines would be superimposed? Is that a weird/complex question to be asking?

briandk
  • 6,749
  • 8
  • 36
  • 46
  • For what it's worth, I'm not wedded to ggplot2. I've just found it to be a really intuitive and powerful package for data display, particularly since I'm an R novice :-) – briandk Mar 28 '10 at 00:05
  • I'm not sure what you want. Isn't that just a 68% confidence interval? I was always taught to plot the absolute residuals and a loess. It's a simpler check for varying variance. – hadley Mar 28 '10 at 04:05
  • Hadley - My book gives an example of homoscedasticity, and two examples of heteroscedasticity: http://picasaweb.google.com/brian.danielak/HomoscedasticityPlots?feat=directlink In the second and third photos, the 0-line for loess doesn't wiggle much at all, but the +1sd and -1sd lines bring out the pattern in the residuals. Granted, the patterns are visible _without_ any loess, but what if they were less apparent in the plot? I can't tell if my problem is at the coding level or the conceptual stats level. Given the data in those pictures, how would you think about assessing homoscedasticity? – briandk Mar 28 '10 at 05:22
  • Correct me if I'm wrong Hadley, but I believe that the standard errors for loess are computed assuming homoskedasicity (via residual.scale). Briandk, this doesn't answer your question, but I would use something like: qplot (x = dist, y = abs(.resid), data = mod) + geom_smooth(), and if the line is not flat, use hccm. – Ian Fellows Mar 28 '10 at 06:30
  • ... I wish I had known about fortify() 2 hours ago. – Matt Parker Mar 29 '10 at 19:34
  • @Matt - The good news is you know about it now! – briandk Mar 30 '10 at 02:58

3 Answers3

4

Apology

Folks, I want to apologize for my ignorance. Hadley is absolutely right, and the answer was right in front of me all along. As I suspected, my question was born of statistical, rather than programmatic ignorance.

We get the 68% Confidence Interval for Free

geom_smooth() defaults to loess smoothing, and it superimposes the +1sd and -1sd lines as part of the deal. That's what Hadley meant when he said "Isn't that just a 68% confidence interval?" I just completely forgot that's what the 68% interval is, and kept searching for something that I already knew how to do. It didn't help that I'd actually turned the confidence intervals off in my code by specifying geom_smooth(se = FALSE).

What my Sample Code Should Have Looked Like

# First, I'll make a simple linear model and get its diagnostic stats.
library(ggplot2)
data(cars)
mod <- fortify(lm(speed ~ dist, data = cars))
attach(mod)
str(mod)

# Now I want to make sure the residuals are homoscedastic.
# By default, geom_smooth is loess and includes the 68% standard error bands.
qplot (x = dist, y = .resid, data = mod) + 
geom_abline(slope = 0, intercept = 0) +
geom_smooth() 

What I've Learned

Hadley implemented a really beautiful and simple way to get what I'd wanted all along. But because I was focused on loess lines, I lost sight of the fact that the 68% confidence interval was bounded by the very lines I needed. Sorry for the trouble, everyone.

briandk
  • 6,749
  • 8
  • 36
  • 46
1

Could you calculate the +/- standard deviation values from the data and add a fitted curve of them to the plot?

djq
  • 14,810
  • 45
  • 122
  • 157
  • Great suggestion, though I don't think it would quite work. Cohen et al (2003) add this: "The other two lines {mean + 1sd and mean - 1sd} are created using the lowess procedure to estimate values 1 standard deviation above and 1 standard deviation below the lowess line." Since the sd calculations (it seems) are effectively point-wise computations in the lowess algorithm, I'm not sure your suggestion would accurately produce what we want. If what you mean is to calculate the sd of the dataset, then add/subtract that from the lowess estimates, that would create exact copies of the curve :-( – briandk Mar 28 '10 at 02:15
  • Ah I see... Thanks for explaining. One other good source of ggplot2 advice is ggplot2@googlegroups.com, so if you don't resolve your question here, that is an option (also the ggplot2 tag exists) – djq Mar 28 '10 at 02:20
1

Have a look at my question "modify lm or loess function.."

I am not sure I followed your question very well, but maybe a:

+ stat_smooth(method=yourfunction)

will work, provided that you define your function as described here.

Community
  • 1
  • 1
dalloliogm
  • 8,718
  • 6
  • 45
  • 55
  • @dalloliogm - Thanks for the suggestion! It turns out (see below) that I don't even need to specify a custom `yourfunction`, because the default confidence bands for `geom_smooth()` are bounded by precisely the lines I was trying to plot all along. – briandk Mar 30 '10 at 03:20