1

I am trying to create a correlation matrix between my X and Y variables and display this information in a nice figure. I am currently using ggpairs() from the GGally package, but if there's a better way to do this then I am happy to try something new. The figure should:

-Fit linear regression models (using lm) between X and Y variables

-Display scatterplots with a regression line

-Display the Coefficient of the determination (R2)

-Map the colour of points/lines/R2 values by group

I have been able to do most of this, but ggpairs only displays the correlation coefficient (r) and not the coefficient of determination (R2). I was able to use the suggestion from this post, but unfortunately the solution does not display R2 values by group.

So far:

library(GGally)
library(ggplot2)

cars <- mtcars
cars$group <- factor(c(rep("A", 16), rep("B", 16))) #adding grouping variable

#function to return R2 (coefficient of determination) and not just r (Coefficient of correlation) in the top portion of the figure
upper_fn <- function(data, mapping, ndp=2, ...){
  
  # Extract the relevant columns as data
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # Calculate the r^2 & format output
  m <- summary(lm(y ~ x))
  lbl <- paste("r^2: ", formatC(m$r.squared, digits=ndp, format="f"))
  
  # Write out label which is centered at x&y position
  ggplot(data=data, mapping=mapping) + 
    annotate("text", x=mean(x, na.rm=TRUE), y=mean(y, na.rm=TRUE), label=lbl, parse=TRUE, ...)+
    theme(panel.grid = element_blank()) 
}


#lower function basically fits a linear model and displays points 
lower_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point(alpha = 0.7) + 
    geom_smooth(method=lm, fill="blue", se = F, ...)
  p
}

#The actual figure
  ggpairs(cars,
    columns = c(1:11),
    mapping = ggplot2::aes(color = group),
    upper = list(continuous = "cor", size = 15),
    diag = list(continuous = "densityDiag", alpha=0.5),
    lower = list(continuous = lower_fn))
Axeman
  • 32,068
  • 8
  • 81
  • 94
seak23
  • 195
  • 9
  • yes, you will need to change `upper_fn`: split the data according to the group, run `lm` over each group and grab the r^2's from each. https://stackoverflow.com/questions/68459268/is-it-possible-to-split-correlation-box-to-show-correlation-values-for-two-diffe/68475964#68475964 gives an example of how to split (but the example is a lot more complicated than you need here so oyu can ignore most of the code) – user20650 Feb 11 '22 at 21:00
  • Thanks for your reply! I'm still struggling with this. Based on your link, do you mean I should use ```unlist()``` and ```lapply()``` ? – seak23 Feb 12 '22 at 18:11
  • yes. from a glance using the code in the `!is.null(col)` condition, you will likely only need the first three lines (the others were for colouring and sizing the rectangles). But change the `cor` stuff in the `lapply` to grab the r^2 instead – user20650 Feb 12 '22 at 20:25
  • I'm still struggling with this. Do you mean something like ```unlist(lapply(idx, function(i) summary(lm(y[i] ~ x[i], method=method, use=use))))``` ? Do I need to specify ```method``` in the function call early on? – seak23 Feb 15 '22 at 16:05

1 Answers1

1

Based on Is it possible to split correlation box to show correlation values for two different treatments in pairplot?, below is a little code to get you started.

The idea is that you need to 1. split the data over the aesthetic variable (which is assumed to be colour), 2. run a regression over each data subset and extract the r^2, 3. quick calculation of where to place the r^2 labels, 4. plot. Some features are left to do.

upper_fn <- function(data, mapping, ndp=2, ...){
  
  # Extract the relevant columns as data
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  col <- eval_data_col(data, mapping$colour)

  # if no colour mapping run over full data
  if(is.null(col)) {
        ## add something here
    }

  # if colour aesthetic, split data and run `lm` over each group
  if(!is.null(col)) {
    idx <- split(seq_len(nrow(data)), col)
    r2 <- unlist(lapply(idx, function(i) summary(lm(y[i] ~ x[i]))$r.squared))

    lvs <- if(is.character(col)) sort(unique(col)) else levels(col)
    cuts <- seq(min(y, na.rm=TRUE), max(y, na.rm=TRUE), length=length(idx)+1L)
    pos <- (head(cuts, -1) + tail(cuts, -1))/2 

    p <- ggplot(data=data, mapping=mapping, ...) + 
            geom_blank() + 
            theme_void() + 
            # you could map colours to each level 
            annotate("text", x=mean(x), y=pos, label=paste(lvs, ": ", formatC(r2, digits=ndp, format="f")))
    }
  
  return(p)
}
user20650
  • 24,654
  • 5
  • 56
  • 91