0

I am trying to display my results using violin plot and box plot at the same time.

I am using cell count to display the number of immune cells in different cancer samples/groups. When I plot the expression for 4 samples, everything works. When I add another sample (GTEx_M2), the violin plots for all other 4 samples disappear and I end up with only the box plots.

Any suggestion? Thanks in advance!

library(ggplot2)
library(ggpubr)
Cibersort7 = structure(list(
  Hot_M1 = c(0.0214400757119873, 0.170557805230298, 0.0804456569076382, 
             0.0893978598771954, 0.134477669028274, 0, 0.0525708788146097, 
             0.0511711964723951, 0.126904881120795, 0.0485101553521798, 
             0.170894800822398, 0.106555021195299, 0.0970104286070479, 
             0.115825265978309, 0.0427923320117795, 0.0733825856784013, 
             0.0111265771852828, 0.0657019859547462, 0.11656416302191,
             0.172002238486688, 0.0154591596631105, 0.0350445248592811, 
             0.0795539781894198, 0.0781276090630857, 0.0087982313041526, 
             0.0289274652853823, 0.0712661645666698, 0.0435482190581647, 
             0.0455556872660798, 0.0871522448556361), 
  Cold_M1 = c(0.0346024087291239, 0.0201947741817111, 0.0306194109725081, 
              0.0277445612030966, 0.00905915199266666, 0.00939058305405205, 
              0.0146535473252646, 0.0159980760737253, 0.147670469457772, 
              0.0426119074182886, 0.0219251208462312, 0.0128996237306264, 
              0.0094816829459359, 0.0219336027293415, 0.0438220246067735, 
              0.00950926112282649, 0.0838386603270565, 0.0486661009213444, 
              0.00651564872414969, 0.00110323590537234, 0.0807125087307139, 0, 
              0.037709808301658, 0, 0.0898041410439557, 0.0417739517920607, 0, 
              0.0202168551193018, 0.00176008746063679, 0.0161337603014608), 
  Hotnorm_M1 = c(0.00622155478760928, 0.00864956989565159, 0.0245812979257332, 
                 0.0339687958970202, 8e-04, 0, 0.0582086801600888, 0, 
                 0.03481918582501, 0.021338008027511, 0.0157360408231509, 
                 0.00489068636912568, 0.0281166183638247, 0.0162726467268935, 
                 0.0415769266772567, 0, 0.00344830695596762, 0.00196737745405557, 
                 0.0075141479562764, 0.0232464687737552, 0, 0, 0.0289423690350636, 
                 0.0218584208695064, 0.0255945495324721, 4e-04, 0.0221942067802419, 
                 0.00476738514342175, 0.00722699142988291, 0.00974645683928458), 
  Coldnorm_M1 = c(0.0280536098964266, 0.0261826834038114, 0.0150413750071331, 0, 
                  0.0199730743908202, 0.0115748800373456, 0.0275674859254823, 
                  0.0168847795974374, 0.0140281070945953, 0.00907861159279308, 
                  0, 0, 0, 0.0453414461512909, 0, 0.00730963773612433, 
                  0.0236424416792874, 0.0866914356225127, 0.0246339344582405, 
                  0.00881531992455549, 0.0140744199322424, 0, 0, 0, 
                  0.0319211626770028, 0.00155291355277603, 0.00295913497381517, 
                  0.00738775271575955, 0.0179786878323852, 0.00442919920031897), 
  GTEx_M1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
              0, 0, 0, 0.00551740159760184, 0, 0, 0, 0, 0)), 
  row.names = c(NA, -30L), 
  class = c("tbl_df", "tbl", "data.frame"))

This is a small part of my data that still shows the same issue I see.

y_axis  = list(na.omit(Cibersort7$Hot_M1), 
               na.omit(Cibersort7$Cold_M1), 
               na.omit(Cibersort7$Hotnorm_M1), 
               na.omit(Cibersort7$Coldnorm_M1), 
               na.omit(Cibersort7$GTEx_M1))
groupname = groupexpression = data = violinPlot  = pairwise_results = list(5)

for (i in 1:5){
  groupname[[i]] = as.factor(colnames(Cibersort7[, i]))
  groupexpression[[i]] = y_axis[[i]]
  data[[i]] = data.frame("Sample" = groupname[[i]], 
                         "Expression" = groupexpression[[i]])
}
dataframe = do.call(rbind, data)
dataframe$Sample = as.factor(dataframe$Sample)

my_comparisons = list(c("Hot_M1", "Cold_M1"),
                      c("Hot_M1", "Hotnorm_M1"), 
                      c("Hot_M1", "GTEx_M1"),
                      c("Cold_M1", "Coldnorm_M1"),
                      c("Cold_M1", "GTEx_M1"))

violinPlot = ggplot(dataframe, 
                    aes(x =Sample, y = Expression, fill = Sample)) + 
  geom_violin(trim = FALSE) + 
  geom_boxplot(width=0.1, fill="white") + 
  labs(title ="Distribution of M2 Macrophages", 
       x = "Tissue Samples", y = "Cibersort Count") + 
  theme_classic()

violinPlot

Here is how my violin plots look like:

plot 1

Here is how they look like before adding the GTEx data:

plot 2

And here's GTEx violin plots when displayed alone:

plot 3

I understand that my GTEx data is zero but why do the violin plots disappear?

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
Miso
  • 1
  • 4
  • 2
    Welcome to Stack Overflow! Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) – Tung Mar 18 '19 at 01:28
  • Can you edit your question and put the data there? Also make sure that your `ggplot` code work with that sample dataset. – Tung Apr 04 '19 at 02:00
  • I just did @Tung – Miso Apr 04 '19 at 02:23
  • Actually based on your sample data (which has only 2 valid values per group), no violin plot is plotted for any group at all... – Z.Lin Apr 05 '19 at 08:24
  • I just added a more representative data set. Thanks for baring with me! I also added more plots to illustrate the issue better – Miso Apr 05 '19 at 20:09

1 Answers1

0

geom_violin has an argument named scale, which takes on the default value "area". From ?geom_violin:

if "area" (default), all violins have the same area (before trimming the tails). If "count", areas are scaled proportionally to the number of observations. If "width", all violins have the same maximum width.

Since GTEx's Expression values are concentrated at 0, its density peaks sharply at that value. We can see it more obviously in a normal density plot, with each sample's line overlaid atop one another:

ggplot(dataframe,
       aes(x = Expression, color = Sample)) +
  geom_density() +
  theme_classic()

density plot

With the default scale = "area" argument, including GTEx in the data means the violin plot for all other samples becomes a lot skinnier, & hence become almost completely covered by the boxplots. You'd still be able to see them if you comment out the boxplot layer.

You can set scale = "width" instead if you want comparable visibility between each violin. You may also want to highlight this to your target audience if you choose this option, as scale = "area" tends to be more common, & people may feel confused when some violins appear clearly larger than others.

ggplot(dataframe, 
       aes(x = Sample, y = Expression, fill = Sample)) + 
  geom_violin(trim = FALSE, scale = "width") +
  geom_boxplot(width=0.1, fill="white") +
  labs(title ="Distribution of M2 Macrophages", 
       x = "Tissue Samples", y = "Cibersort Count") + 
  theme_classic()

violin plot

p.s. You can simplify your data processing steps, which are (from what I can tell) essentially a conversion from wide to long format. The usual way to do this is via melt (from reshape2 package) or gather (from tidyr package). Here's a possible implementation:

library(dplyr)
library(tidyr)

df2 <- Cibersort7 %>%
  gather(Sample, Expression) %>%
  mutate(Sample = factor(Sample, levels = colnames(Cibersort7)))

> all.equal(dataframe, as.data.frame(df2))
[1] TRUE

p.p.s. If there are multiple people commenting in your thread & you don't @ anyone in your reply, no one is going to get any notification about it, which is rather a waste if you've gone through all the trouble of improving your question. See here for an explanation of how the system works.

Z.Lin
  • 28,055
  • 6
  • 54
  • 94