-1

I have a table with normalized scores across samples. I have calculated the z scores and need to create the plot shown below:

enter image description here

My table looks like:

Gene_name E_2401_ctrl_1 E_2401_ctrl_2 E_2401_ctrl_3 E_2401_drt_1 E_2401_drt_2 LOC_Os01g01312 1.601736731 0.310548148 1.238589942 -0.899941148 -0.980640435 LOC_Os01g01360 -0.850254449 -0.420792594 0.083358279 0.86759297 0.102030534 LOC_Os01g01390 0.40382069 -0.377555928 -0.009849284 -0.285629267 0.219967368 LOC_Os01g01610 -1.102507436 -0.90329537 -0.458899223 1.042853272 0.904937227 LOC_Os01g01620 -0.806239145 -1.190898502 -0.229250108 0.812535653 1.004865332

I have almost 3000 genes and 50 samples. Hence plotting in excel is not an option.

KoenV
  • 4,113
  • 2
  • 23
  • 38
Bitsy
  • 303
  • 1
  • 3
  • 13
  • 2
    OK, so start reading some ggplot2 tutorials. – Roland Sep 08 '17 at 07:22
  • 1
    A [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) once you understand your problem better is the best way to get a good response here. – Kevin Arseneau Sep 08 '17 at 07:30

1 Answers1

3

Create a data frame from your data:

df <- data.frame(Gene_name = c("E_2401_ctrl_1", "E_2401_ctrl_2", "E_2401_ctrl_3", "E_2401_drt_1", "E_2401_drt_2"),
              LOC_Os01g01312 = c(1.601736731, 0.310548148, 1.238589942, -0.899941148, -0.980640435),
              LOC_Os01g01360 = c(-0.850254449, -0.420792594, 0.083358279, 0.86759297, 0.102030534),
              LOC_Os01g01390 = c(0.40382069 , -0.377555928, -0.009849284, -0.285629267, 0.219967368),
              LOC_Os01g01610 = c(-1.102507436, -0.90329537, -0.458899223, 1.042853272, 0.904937227),
              LOC_Os01g01620 = c(-0.806239145, -1.190898502, -0.229250108, 0.812535653, 1.004865332))

library(ggplot2)
library(reshape2)

for ggplot reshaping to long format is a good idea

df_melt <- reshape2::melt(df, id.vars = "Gene_name")

check how the data looks now

head(df_melt, 10)

gene names are in one column and the corresponding z-scores are in another

ggplot(data = df_melt)+
      geom_line(aes(x = variable, y = value, group = Gene_name))+
      theme(axis.text.x = element_text(angle = 45, hjust = 1))+
      xlab("gene")+
      ylab("")

how this reads:

ggplot(data = df_melt)

specifies the data from which to draw the plot

geom_line(aes(x = variable, y = value, group = Gene_name))

geom_line since you want lines connecting the values. In ggplot all the variables go inside aes().

theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  xlab("")+
  ylab("z-score")

the rest is just makeup

if you would like faceted plots, add a variable by which you will facet

df_melt <- data.frame(rbind(df_melt, df_melt), 
    letters=rep(c("A", "B"), each = nrow(df_melt)))

here I just repeat the data frames two times,

rbind(df_melt, df_melt)

and label rows from first one with "A" and second one with "B".

letters=rep(c("A", "B"), each = nrow(df_melt))
df_melt

now you can facet by the "letters" variable

ggplot(data = df_melt)+
  geom_line(aes(x = variable, y = value, group = Gene_name))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  xlab("gene")+
  ylab("z-score")+
  facet_wrap(~letters, ncol = 1)

EDIT: cluster labels can be colored by adding the color = variable argument to the aes() call inside the geom you want colored. I will start from the begining:

df <- data.frame(Gene_name = c("E_2401_ctrl_1", "E_2401_ctrl_2", "E_2401_ctrl_3", "E_2401_drt_1", "E_2401_drt_2"),
                 LOC_Os01g01312 = c(1.601736731, 0.310548148, 1.238589942, -0.899941148, -0.980640435),
                 LOC_Os01g01360 = c(-0.850254449, -0.420792594, 0.083358279, 0.86759297, 0.102030534),
                 LOC_Os01g01390 = c(0.40382069 , -0.377555928, -0.009849284, -0.285629267, 0.219967368),
                 LOC_Os01g01610 = c(-1.102507436, -0.90329537, -0.458899223, 1.042853272, 0.904937227),
                 LOC_Os01g01620 = c(-0.806239145, -1.190898502, -0.229250108, 0.812535653, 1.004865332))

df_melt <- reshape2::melt(df, id.vars = "Gene_name")

#the ifelse() part makes another column called "lett" where if it is a "crtl" gene will be "A" and "B" if not

df_melt <- data.frame(rbind(df_melt, df_melt), 
                      lett = ifelse(grepl("ctrl", df_melt$Gene_name), "A", "B"))


ggplot(data = df_melt)+
  geom_line(aes(x = variable, y = value,group = Gene_name, color=lett))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  xlab("gene")+
  ylab("z-score")+
  scale_color_manual(values=c("A" = "red", "B" = "blue"))

more on controlling the colors:

http://ggplot2.tidyverse.org/reference/scale_manual.html http://ggplot2.tidyverse.org/reference/scale_brewer.html

missuse
  • 19,056
  • 3
  • 25
  • 47
  • Thank you so much for the help. There are times when my genes are part of different clusters. I have the cluster labels for them. Is it possible to color the lines based on the cluster labels ? Basically, beside the gene_name columns, I will have a column with the corresponding cluster labels. – Bitsy Sep 09 '17 at 11:57
  • Glad to help. Edited the post, see above. – missuse Sep 09 '17 at 13:08