11

I'm looking for a more convenient way to get a Q-Q plot in ggplot2 where the quantiles are computed for the data set as a whole. but I can use mappings (colour/shapes) for groups in the data.

library(dplyr)
library(ggplot2)
library(broom) ## for augment()

Make up some data:

set.seed(1001)
N <- 1000
G <- 10
dd <- data_frame(x=runif(N),
             f=factor(sample(1:G,size=N,replace=TRUE)),
             y=rnorm(N)+2*x+as.numeric(f))
m1 <- lm(y~x,data=dd)
dda <- cbind(augment(m1),f=dd$f)

Basic plot:

ggplot(dda)+stat_qq(aes(sample=.resid))

enter image description here

if I try to add colour, the groups get separated for the quantile computation (which I don't want):

ggplot(dda)+stat_qq(aes(sample=y,colour=f))

enter image description here

If I use stat_qq(aes(sample=y,colour=f,group=1)) ggplot ignores the colour specification and I get the first plot back.

I want a plot where the points are positioned as in the first case, but coloured as in the second case. I have a qqnorm-based manual solution that I can post but am looking for something nicer ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • I don't think the data set you used in this question was generated with the correct seed. Note where your second and third largest sample points are relative to the answer, which is also what I got when running your code. Can you fix this? I found it confusing and initially thought the answer given below was incorrect. – DaSpeeg Jun 20 '18 at 14:57
  • better now .... ? – Ben Bolker Jun 20 '18 at 15:46

2 Answers2

6

You could calculate the quantiles yourself and then plot using geom_point:

dda = cbind(dda, setNames(qqnorm(dda$.resid, plot.it=FALSE), c("Theoretical", "Sample")))

ggplot(dda) + 
  geom_point(aes(x=Theoretical, y=Sample, colour=f))

enter image description here

Ah, I guess I should have read to the end of your question. This is the manual solution you were referring to, right? Although you could just package it as a function:

my_stat_qq = function(data, colour.var) {

  data=cbind(data, setNames(qqnorm(data$.resid, plot.it=FALSE), c("Theoretical", "Sample")))

  ggplot(data) + 
    geom_point(aes_string(x="Theoretical", y="Sample", colour=colour.var))

}

my_stat_qq(dda, "f")
eipi10
  • 91,525
  • 24
  • 209
  • 285
3

Here's a ggproto-based approach that attempts to change StatQq, since the underlying issue here (colour specification gets ignored when group is specified explicitly) is due to how its compute_group function is coded.

  1. Define alternate version of StatQq with modified compute_group (last few lines of code):
StatQq2 <- ggproto("StatQq", Stat,
                  default_aes = aes(y = after_stat(sample), x = after_stat(theoretical)),
                  
                  required_aes = c("sample"),
                  
                  compute_group = function(data, scales, quantiles = NULL,
                                           distribution = stats::qnorm, dparams = list(),
                                           na.rm = FALSE) {

                    sample <- sort(data$sample)
                    n <- length(sample)
                    
                    # Compute theoretical quantiles
                    if (is.null(quantiles)) {
                      quantiles <- stats::ppoints(n)
                    } else if (length(quantiles) != n) {
                      abort("length of quantiles must match length of data")
                    }
                    
                    theoretical <- do.call(distribution, c(list(p = quote(quantiles)), dparams))
                    
                    res <- ggplot2:::new_data_frame(list(sample = sample, 
                                                         theoretical = theoretical))
                    
                    # NEW: append remaining columns from original data 
                    # (e.g. if there were other aesthetic variables),
                    # instead of returning res directly
                    data.new <- subset(data[rank(data$sample), ], 
                                       select = -c(sample, PANEL, group))
                    if(ncol(data.new) > 0) res <- cbind(res, data.new)

                    res
                  }
)
  1. Define geom_qq2 / stat_qq2 to use modified StatQq2 instead of StatQq for their stat:
geom_qq2 <- function (mapping = NULL, data = NULL, geom = "point", 
                      position = "identity", ..., distribution = stats::qnorm, 
                      dparams = list(), na.rm = FALSE, show.legend = NA, 
                      inherit.aes = TRUE) {
  layer(data = data, mapping = mapping, stat = StatQq2, geom = geom, 
        position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
        params = list(distribution = distribution, dparams = dparams, 
                      na.rm = na.rm, ...))
}

stat_qq2 <- function (mapping = NULL, data = NULL, geom = "point", 
                      position = "identity", ..., distribution = stats::qnorm, 
                      dparams = list(), na.rm = FALSE, show.legend = NA, 
                      inherit.aes = TRUE) {
  layer(data = data, mapping = mapping, stat = StatQq2, geom = geom, 
        position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
        params = list(distribution = distribution, dparams = dparams, 
                      na.rm = na.rm, ...))
}

Usage:

cowplot::plot_grid(
  ggplot(dda) + stat_qq(aes(sample = .resid)),        # original
  ggplot(dda) + stat_qq2(aes(sample = .resid,         # new
                             color = f, group = 1))
)

result

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
  • Although I feel very awkward to question one of your answers, I am not sure if this is resulting in the desired output - I think it should in the end look more like in [this answer](https://stackoverflow.com/a/42679222/7941188) ? – tjebo Jun 23 '20 at 11:26