0

I have a melted dataframe df with first column sample names, second column Group, third column Genes, fourth column Expression (logCPM).

head(df)

sample names    Group   Genes   Expression (logCPM)
Sample1        GroupA   Gene1   3.45
Sample2        GroupA   Gene1   2.34
Sample3        GroupA   Gene1   0.5667
Sample4        GroupA   Gene1   1.98
Sample5        GroupA   Gene1   0.45
Sample6        GroupB   Gene1   4.566
Sample7        GroupB   Gene1   0.5667

I'm trying to make a violin plot combining box plot with following code:

positions <- c("GroupA", "GroupB")
e <- ggplot(df, aes(x = Genes, y = Expression (logCPM)))
e2 <-  e + geom_violin(
  aes(color = Group), trim = FALSE,
  position = position_dodge(0.9), draw_quantiles=c(0.5)) +
  geom_boxplot(
    aes(color = Group), width = 0.01,
    position = position_dodge(0.9)) +
  scale_color_manual(legend_title, values = c("GroupA"="#FC4E07", "GroupB"="#00AFBB")) +
  theme_bw(base_size = 14) + xlab("") + ylab("Expression (logCPM)") +
  theme(axis.text=element_text(size=15, face = "bold", color = "black"),
        axis.title=element_text(size=15, face = "bold", color = "black"),
        strip.text = element_text(size=15, face = "bold", color = "black"),
        axis.text.x = element_text(angle = 0),
        legend.text=element_text(size=12, face = "bold", color = "black"),
        legend.title=element_text(size=15,face = "bold", color = "black"))
e2

enter image description here

I am trying to create violin plots with boxplots within each violin plot. But it doesn't look good. It doesn't look like a violin plot instead looks like a line. Is there anything I have to correct for aligning?

The data I'm using is huge

beginner
  • 1,059
  • 8
  • 23
  • In your `position_dodge()` calls, does reducing the width aid with your intent? Because the x aesthetic is a factor, it defaults at putting the center of each vertical set at the integers 1, 2, ..., N. Then by specifying a dodge width of 0.9 places the two dodged entities each occupying a space of 0.05 each. Can't say that is a guarantee, but is more of a guess into the inner workings of the ggplot2 package. – statstew May 18 '20 at 02:25
  • Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) – Tung May 18 '20 at 02:37

2 Answers2

0

I had to copy your first three samples to GroupB make up for the low samples size. Is this what you're looking for?

library(tidyverse)
df <- tribble(~"sample names",~Group,~Genes,~"Expression (logCPM)",
              "Sample1","GroupA","Gene1",3.45,
              "Sample2","GroupA","Gene1",2.34,
              "Sample3","GroupA","Gene1",0.5667,
              "Sample4","GroupA","Gene1",1.98,
              "Sample5","GroupA","Gene1",0.45,
              "Sample6","GroupB","Gene1",4.566,
              "Sample7","GroupB","Gene1",0.5667,
              "Sample8","GroupB","Gene1",3.45, # extra, copied from Sample 1
              "Sample9","GroupB","Gene1",2.34, # extra, copied from Sample 2
              "Sample10","GroupB","Gene1",0.5667) # extra, copied from Sample 3

ggplot(df, aes(x = Genes, y = `Expression (logCPM)`,group = Group, fill = Group)) + # I prefer to store all the aes() in the first ggplot() layer so that the remaining layers can just be about customising the plot
  geom_violin(trim = FALSE,alpha = 0.5, draw_quantiles=c(0.5),position = position_dodge(1)) +
  geom_boxplot(width = 0.1,position = position_dodge(1)) +
  theme_bw() # + other theme settings

enter image description here

Anurag N. Sharma
  • 362
  • 2
  • 10
  • I know this. But the problem is if I do it separately for each `Gene` the violin plot looks good. But if I'm trying to make a single plot for all genes like what I gave in the question, I don't see the violin at all. – beginner May 18 '20 at 07:19
  • If you provide a larger portion of your dataset then I could help you out. – Anurag N. Sharma May 18 '20 at 07:56
  • this is the problem. the data is super big. so couldn't post here. – beginner May 18 '20 at 09:43
  • If you're planning on plotting more than 10 or so violin plots in the same figure, it can end up looking too thin irrespective of any adjustments you make. What exactly is it that you're trying to depict by plotting these many counts per millions for the genes? How many genes? – Anurag N. Sharma May 18 '20 at 10:01
0

adding scale = "width" helped for me:

geom_violin(aes(x=x, y=y, group=x),
          scale = "width")
lizaveta
  • 353
  • 1
  • 13