0

I have made a loop to create a list of scatter plots of difference (y axis) vs sample (x axis). For these plots, outliers are defined as samples where difference between two counts for the same sample was larger than 10%. I have coloured outliers in red and would like the non-outliers in green but for some reason, they come out as green and grey (specifically I think the points where difference = 0 are grey). I would love to know if there is an error somewhere in my code that I am missing?

Below is my code to create the list of scatter plots

myplots <- list()

for (x in c(1:length(stage2))) {
  message(stage2[x])
  myplots[[stage2[x]]] <- local({
    x <- x
    perc_diff <- (abs(df[, paste0("D_", stage2[x])])/
                  (df[, paste0("M_", stage2[x])] + 
                   df[ , paste0("H_", stage2[x])])/2)*100
    sct <- ggplot(df, aes(x=sample, y= df[, paste0("D_", stage2[x])])) + 
      geom_point(df, size= 2.2, 
           mapping = aes(colour = ifelse(perc_diff >= 10 , "outlier", 
                         "non-outlier"))) +
      labs(y = "", x= "") + 
      theme(axis.text.x = element_blank(),
             axis.ticks.x=element_blank(),
            legend.position = "none") +
      scale_color_manual(values=c("palegreen3", "tomato3"))
    print(sct)
  })
}

I then title them and use grid arrange to arrange them:

for (i in c(1:length(D_stage))){
  myplots[[i]] <- myplots[[i]] + ggtitle((stage2[i])) + 
                  theme(plot.title = element_text(size = 25))
}
gridExtra::grid.arrange(grobs = myplots, ncol = 2, 
            left = "count difference", bottom =  "sample")

This is what I get - I need the grey points to be green as well: this is what I get - I need the grey points to be green as well

kjetil b halvorsen
  • 1,206
  • 2
  • 18
  • 28
Juju
  • 1
  • 1
    The grey colored points are NAs, i.e. values where perc_diff is NA. Personally I would check what's the reason for that. But a quick fix would be to add `na.value="palegreen3"` to scale_color_manual. Also, note that mapping "external" vectors on aesthetics is not recommended. Instead do `y= .data[[paste0("D_", stage2[x])]]` and make per_diff a column of your dataset using `df$perc_diff <- ...`. For more help please provide [a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a snippet of your data or some fake data. – stefan Mar 16 '23 at 11:33

1 Answers1

0

First off - commenting out legend.position = "none" in the theme() layer will show you what the colors represent (for ggplot). So doing that should give you some idea of what is going on.

But as indicated in the first comment: those values are generally NA. If I take your perc_diff calculation and simplify it:

(abs(a) / (b + c) / 2) * 100

it indicates that if you have missing values in b or c, you will get NA as result, i.e. a third color category.

Here is an example with some dummy data for a, b, c and sample, without the loop:

library(tidyverse)

df <- data.frame("a" = c(0,2,2,3,0,6,6,8,10,0),
                 "b" = c(10,3,25,45,NA,5,9,30,12,4),
                 "c" = c(3,7,4,5,18,0,3,19,3,NA),
                 "sample" = c(1,2,3,4,5,6,7,8,9,10))

df$perc_diff <- (abs(df$a)/(df$b + df$c)/2)*100

ggplot(df, aes(x = sample, y = a)) + 
  geom_point(df, size= 2.2, mapping = aes(colour = ifelse(perc_diff >= 10 , "outlier",      "non-outlier"))) +
  labs(y = "", x= "") + 
  theme(axis.text.x = element_blank(),
        axis.ticks.x=element_blank()) +
  scale_color_manual(values=c("palegreen3", "tomato3"))

This shows you that NA is grey in the legend. In addition, you will notice that very specifically you probably have missing data in b or c in conjuction with a being 0 - since all your grey points are where the y value is 0.

So check your data (are you okay with having missing data), or add color category for NA values, as indicated in the comment. And agreed with the rest of the comment: I would even make the "outlier" and "non-outlier" information a column.

Geraldine
  • 771
  • 3
  • 9
  • 23