1

I have a large dataset of gene expression from ~10,000 patient samples (TCGA), and I'm plotting a predicted expression value (x) and the actual observed value (y) of a certain gene signature. For my downstream analysis, I need to draw a precise line through the plot and calculate different parameters in samples above/below the line.
No matter how I draw a line through the data (geom_smooth(method = 'lm', 'glm', 'gam', or 'loess')), the line always seems imperfect - it doesn't cut through the data to my liking (red line is lm in figure).
After playing around for a while, I realized that the 2d kernel density lines (geom_density2d) actually do a good job of showing the slope/trends of my data, so I manually drew a line that kind of cuts through the density lines (black line in figure).

My question: how can I automatically draw a line that cuts through the kernel density lines, as for the black line in the figure? (Rather than manually playing with different intercepts and slopes till something looks good).

The best approach I can think of is to somehow calculate intercept and slope of the longest diameter for each of the kernel lines, take an average of all those intercepts and slopes and plot that line, but that's a bit out of my league. Maybe someone here has experience with this and can help?

A more hacky approach may be getting the x,y coords of each kernel density line from ggplot_build, and going from there, but it feels too hacky (and is also out of my league).

Thanks!

EDIT: Changed a few details to make the figure/analysis easier. (Density lines are smoother now). Reprex:

library(MASS)
set.seed(123)
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1]  # standard normal (mu=0, sd=1)
y <- data[, 2]  # standard normal (mu=0, sd=1)

test.df <- data.frame(x = x, y = y)
lm(y ~ x, test.df)

ggplot(test.df, aes(x, y)) +
  geom_point(color = 'grey') +
  geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2)) + ### EDIT: h = c(2,2)
  geom_smooth(method = "glm", se = F, lwd = 1, color = 'red') +
  geom_abline(intercept = 0, slope = 0.7, lwd = 1, col = 'black') ## EDIT: slope to 0.7

Figure: enter image description here

tjebo
  • 21,977
  • 7
  • 58
  • 94
user3579613
  • 113
  • 11
  • 1
    Corect, need the black line automatically. Red line is not needed, I only added it here as an example. – user3579613 Jul 06 '18 at 18:22
  • 4
    OK, so at some point I have to ask - what makes you think that the manually drawn black line divides the data more accurately than the algorithmically selected line? That's an assumption that has to be validated first. Then, if validated, it's a modeling problem where you find a better fit based on the evidence used to validate the earlier hypothesis. – Hack-R Jul 06 '18 at 18:55
  • 1
    You're right, it's entirely possible that the lm line is 'statistically better' than my manually drawn line, but the lm line messes up my downstream analysis. My focus here is to draw a good line for splitting data into higher/lower than expected observed values, and you can see the lm line fails near the extremes of the line, while my black line splits the data better overall. That messes up my downstream analysis. See here an example of what I want to do automatically: https://imgur.com/a/mPLyBJv. – user3579613 Jul 06 '18 at 20:04

2 Answers2

3

I generally agree with @Hack-R.
However, it was kind of a fun problem and looking into ggplot_build is not such a big deal.

require(dplyr)
require(ggplot2)

p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2)) 
#basic version of your plot

p_built <- ggplot_build(p)

p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),] %>%
  select(x,y) # extracts the x/y coordinates of the points on the largest ellipse from your 2d-density contour

Now this answer helped me to find the points on this ellipse which are furthest apart.

coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))

p_maxring <- p_maxring %>% 
  mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2)) #extra column specifying the distance of each point to the mean of those points

coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
# gives the coordinates of the point farthest away from the mean point

farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
#now this looks which of the points is the farthest from the point farthest from the mean point :D
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])



 ggplot(test.df, aes(x, y)) +
  geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2)) +
  # geom_segment using the coordinates of the points farthest apart 
  geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
                    xend = coord_fff['x'], yend = coord_fff['y']))) +
  geom_smooth(method = "glm", se = F, lwd = 1, color = 'red') +
# as per your request with your geom_smooth line

  coord_equal()

coord_equal is super important, because otherwise you will get super weird results - it messed up my brain too. Because if the coordinates are not set equal, the line will seemingly not pass through the point furthest apart from the mean...

I leave it to you to build this into a function in order to automate it. Also, I'll leave it to you to calculate the y-intercept and slope from the two points

enter image description here

tjebo
  • 21,977
  • 7
  • 58
  • 94
  • Looks good, thanks a lot! I'll take a look on Monday and see how it works for me. Could you add an 'lm' line to your plot to see how they compare? – user3579613 Jul 08 '18 at 16:09
  • 1
    @user3579613 voilà ! – tjebo Jul 08 '18 at 18:03
  • Thanks for your answer, but I found that your approach (measuring the mean of all points, and finding the two most distant points), has a flaw: it measures the longest distance between two **points**. While that is generally OK, what I'm really after is the longest axis of the ellipse/contours, as sometimes the irregular shape of an ellipse can give confusing results. I will post this in a new answer. – user3579613 Jul 10 '18 at 18:42
2

Tjebo's approach was kind of good initially, but after a close look, I found that it found the longest distance between two points on an ellipse. While this is close to what I wanted, it failed with either an irregular shape of the ellipse, or the sparsity of points in the ellipse. This is because it measured the longest distance between two points; whereas what I really wanted is the longest diameter of an ellipse; i.e.: the semi-major axis. See image below for examples/details.

Briefly:

To find/draw density contours of specific density/percentage:
R - How to find points within specific Contour

To get the longest diameter ("semi-major axis") of an ellipse: https://stackoverflow.com/a/18278767/3579613

For function that returns intercept and slope (as in OP), see last piece of code.
The two pieces of code and images below compare two Tjebo's approach vs. my new approach based on the above posts.

#### Reprex from OP
require(dplyr)
require(ggplot2)
require(MASS)
set.seed(123)
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1]  # standard normal (mu=0, sd=1)
y <- data[, 2]  # standard normal (mu=0, sd=1)
test.df <- data.frame(x = x, y = y)

#### From Tjebo
p <- ggplot(test.df, aes(x, y)) +
  geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 2) 
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>% 
  mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2)) #extra column specifying the distance of each point to the mean of those points
p_maxring = p_maxring[round(seq(1, nrow(p_maxring), nrow(p_maxring)/23)),] #### Make a small ellipse to illustrate flaws of approach
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
# gives the coordinates of the point farthest away from the mean point
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
#now this looks which of the points is the farthest from the point farthest from the mean point :D
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
farthest_2_points = data.frame(t(cbind(coord_farthest, coord_fff)))
plot(p_maxring[,1:2], asp=1)
lines(farthest_2_points, col = 'blue', lwd = 2)


#### From answer in another post
d = cbind(p_maxring[,1], p_maxring[,2])
r = ellipsoidhull(d)
exy = predict(r) ## the ellipsoid boundary
lines(exy)
me = colMeans((exy))           
dist2center = sqrt(rowSums((t(t(exy)-me))^2))
max(dist2center)     ## major axis
lines(exy[dist2center == max(dist2center),], col = 'red', lwd = 2)

enter image description here

#### The plot here is made from the data in the reprex in OP, but with h = 0.5
library(MASS)
set.seed(123)
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1]  # standard normal (mu=0, sd=1)
y <- data[, 2]  # standard normal (mu=0, sd=1)
test.df <- data.frame(x = x, y = y)

## MAKE BLUE LINE
p <- ggplot(test.df, aes(x, y)) +
  geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 0.5)  ## NOTE h = 0.5
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>% 
  mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2))
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])

## MAKE RED LINE
## h = 0.5
## Given the highly irregular shape of the contours, I will use only the largest contour line (0.95) for draing the line.
## Thus, average = 1. See function below for details.
ln = long.diam("x", "y", test.df, h = 0.5, average = 1) ## NOTE h = 0.5

## PLOT
ggplot(test.df, aes(x, y)) +
  geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 0.5) + ## NOTE h = 0.5
  geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
                    xend = coord_fff['x'], yend = coord_fff['y'])), col = 'blue', lwd = 2) +
  geom_abline(intercept = ln[1], slope = ln[2], color = 'red', lwd = 2) +
  coord_equal()

enter image description here Finally, I came up with the following function to deal with all this. Sorry for the lack of comments/clarity

#### This will return the intercept and slope of the longest diameter (semi-major axis).
####If Average = TRUE, it will average the int and slope across different density contours.
long.diam = function(x, y, df, probs = c(0.95, 0.5, 0.1), average = T, h = 2) {
  fun.df = data.frame(cbind(df[,x], df[,y]))
  colnames(fun.df) = c("x", "y")
  dens = kde2d(fun.df$x, fun.df$y, n = 200, h = h)
  dx <- diff(dens$x[1:2])
  dy <- diff(dens$y[1:2])
  sz <- sort(dens$z)
  c1 <- cumsum(sz) * dx * dy 
  levels <- sapply(probs, function(x) { 
    approx(c1, sz, xout = 1 - x)$y
  })
  names(levels) = paste0("L", str_sub(formatC(probs, 2, format = 'f'), -2))
  #plot(fun.df$x,fun.df$y, asp = 1)
  #contour(dens, levels = levels, labels=probs, add=T, col = c('red', 'blue', 'green'), lwd = 2)
  #contour(dens, add = T, col = 'red', lwd = 2)
  #abline(lm(fun.df$y~fun.df$x))

  ls <- contourLines(dens, levels = levels)
  names(ls) = names(levels)

  lines.info = list()
  for (i in 1:length(ls)) {
    d = cbind(ls[[i]]$x, ls[[i]]$y)
    exy = predict(ellipsoidhull(d))## the ellipsoid boundary
    colnames(exy) = c("x", "y")
    me = colMeans((exy))            ## center of the ellipse
    dist2center = sqrt(rowSums((t(t(exy)-me))^2))
    #plot(exy,type='l',asp=1)
    #points(d,col='blue')
    #lines(exy[order(dist2center)[1:2],])
    #lines(exy[rev(order(dist2center))[1:2],])
    max.dist = data.frame(exy[rev(order(dist2center))[1:2],])
    line.fit = lm(max.dist$y ~ max.dist$x)
    lines.info[[i]] = c(as.numeric(line.fit$coefficients[1]), as.numeric(line.fit$coefficients[2]))
  }
  names(lines.info) = names(ls)

  #plot(fun.df$x,fun.df$y, asp = 1)
  #contour(dens, levels = levels, labels=probs, add=T, col = c('red', 'blue', 'green'), lwd = 2)
  #abline(lines.info[[1]], col = 'red', lwd = 2)
  #abline(lines.info[[2]], col = 'blue', lwd = 2)
  #abline(lines.info[[3]], col = 'green', lwd = 2)
  #abline(apply(simplify2array(lines.info), 1, mean), col = 'black', lwd = 4)
  if (isTRUE(average)) {
    apply(simplify2array(lines.info), 1, mean)
  } else {
    lines.info[[average]]
  }
}

Finally, here's the final implementation of the different answers:

library(MASS)
set.seed(123)
samples = 10000
r = 0.9
data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x = data[, 1]  # standard normal (mu=0, sd=1)
y = data[, 2]  # standard normal (mu=0, sd=1)
#plot(x, y)
test.df = data.frame(x = x, y = y)

#### Find furthest two points of contour
## BLUE
p <- ggplot(test.df, aes(x, y)) +
  geom_density2d(color = 'red', lwd = 2, contour = T, h = 2) 
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>% 
  mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2))
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])

#### Find the average intercept and slope of 3 contour lines (0.95, 0.5, 0.1), as in my long.diam function above.
## RED
ln = long.diam("x", "y", test.df)

#### Plot everything. Black line is GLM
ggplot(test.df, aes(x, y)) +
  geom_point(color = 'grey') +
  geom_density2d(color = 'red', lwd = 1, contour = T, h = 2) + 
  geom_smooth(method = "glm", se = F, lwd = 1, color = 'black') +
  geom_abline(intercept = ln[1], slope = ln[2], col = 'red', lwd = 1) +
  geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
                    xend = coord_fff['x'], yend = coord_fff['y'])), col = 'blue', lwd = 1) +
  coord_equal()

Final image

user3579613
  • 113
  • 11
  • that‘s a nice function. indeed your question was not making this point clear enough - maybe you should have provided better sample data with less smooth distribution in order to make the point clear. maybe replace your original dataframe with the new sample data from your answer – tjebo Jul 11 '18 at 09:15
  • Can you specify which part of the question was unclear? As I understood, I asked for help drawing a line through a contour line (ellipse). I even suggested that the best approach would be to find the longest diameter. I think your answer is actually correct, it's just slightly imperfect in specific cases, so I tried to find a 'more perfect' approach. I'd appreciate your comments on this. – user3579613 Jul 11 '18 at 14:59
  • I just removed the first parts of my answer, as it seemed irrelevant to the question at hand. I just left the actual function. – user3579613 Jul 11 '18 at 15:05
  • Haha OK. I was pretty happy with my function and was surprised to get a downvote – user3579613 Jul 11 '18 at 15:05
  • Thanks! I don't think the first part was entirely useless. - The picture with the odd contours was actually very helpful to understand what you *actually* meant... – tjebo Jul 11 '18 at 15:06
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174820/discussion-between-user3579613-and-tjebo). – user3579613 Jul 11 '18 at 15:09
  • 1
    I added back the first part, as well as added a final figure comparing all the different lines. GLM, farthest two points, and longest diameter. It's a long answer but I kept everything in there since it might be useful. @Tjebo, thanks for your help and comments! – user3579613 Jul 11 '18 at 15:33