Why ggplot2 geom_hlines plots more than intended?

Question

Here is a sample of the dataframe I am working with.

> head(tbl[,c('logFC', 'CI_L', 'CI_R', "adj_P_Value","gene",'Group1','Group2', 'Study_ID')])
        logFC        CI_L       CI_R adj_P_Value  gene             Group1                   Group2 Study_ID
1 -0.09017596 -0.43955752 0.25920561           1 CD244               Male                   Female  GSE2461
2  0.08704844 -0.26134341 0.43544028           1 CD244 ulcerative colitis irritable bowel syndrome  GSE2461
3 -0.03501474 -0.12677636 0.05674688           1 CD244   nonlesional skin            lesional skin GSE27887
4  0.01096914 -0.08064105 0.10257932           1 CD244       pretreatment            posttreatment GSE27887
5 -0.03707265 -0.12407201 0.04992672           1 CD244         Infliximab         Before treatment GSE42296
6  0.07644834 -0.02849309 0.18138977           1 CD244          Responder             Nonresponder GSE42296

> dput(droplevels(head(tbl, 4)))
structure(list(Probe_gene = c("211828_s_at", "213107_at", "213109_at", 
"211828_s_at"), logFC = c(0.299038590078202, 0.110797898105632, 
0.183214738942169, -0.733505457149486), CI_L = c(-0.0332844208935414, 
-0.246475718463096, -0.103358698007331, -1.06488707237429), CI_R = c(0.631361601049945, 
0.46807151467436, 0.469788175891669, -0.402123841924678), AveExpr = c(7.38827278419383, 
7.83576862202959, 6.68411901305011, 7.38827278419383), t = c(2.08930195860002, 
0.720053829585981, 1.48442706763586, -5.13936340603241), P_Value = c(0.0714526369900392, 
0.492771856681782, 0.177447421180599, 0.000998740960213292), 
    adj_P_Value = c(1, 1, 1, 1), B = c(-4.07430683864883, -5.56181503167371, 
    -4.83144498851773, -0.294306065125513), gene = c("TNIK", 
    "TNIK", "TNIK", "TNIK"), Study_ID = c("GSE2461", "GSE2461", 
    "GSE2461", "GSE2461"), Group1 = c("Male", "Male", "Male", 
    "ulcerative colitis"), Group2 = c("Female", "Female", "Female", 
    "irritable bowel syndrome"), Study_ID = c("GSE2461", "GSE2461", 
    "GSE2461", "GSE2461"), Disease = c("irritable bowel syndrome; ulcerative colitis", 
    "irritable bowel syndrome; ulcerative colitis", "irritable bowel syndrome; ulcerative colitis", 
    "irritable bowel syndrome; ulcerative colitis"), DOID = c(9778L, 
    9778L, 9778L, 9778L), Title = c("Control (IBS) & Ulcerative colitis (UC) subjects", 
    "Control (IBS) & Ulcerative colitis (UC) subjects", "Control (IBS) & Ulcerative colitis (UC) subjects", 
    "Control (IBS) & Ulcerative colitis (UC) subjects"), GEO_Platform_ID = c("GPL96", 
    "GPL96", "GPL96", "GPL96"), Platform = c("Affymetrix Human U133A Array", 
    "Affymetrix Human U133A Array", "Affymetrix Human U133A Array", 
    "Affymetrix Human U133A Array"), PMID = c(0L, 0L, 0L, 0L), 
    Organism = c("Homo sapiens", "Homo sapiens", "Homo sapiens", 
    "Homo sapiens"), Data_Type = c("RNA", "RNA", "RNA", "RNA"
    ), Biomaterial = c("Colonic Mucosal biopsy", "Colonic Mucosal biopsy", 
    "Colonic Mucosal biopsy", "Colonic Mucosal biopsy"), Study_Type = c("in vivo", 
    "in vivo", "in vivo", "in vivo"), Samples = c(8L, 8L, 8L, 
    8L), Time_Point = c("Baseline", "Baseline", "Baseline", "Baseline"
    ), Treatment = c("NA", "NA", "NA", "NA"), Treatment_Protocol = c("NA", 
    "NA", "NA", "NA"), Raw_Data = c(0L, 0L, 0L, 0L), Notes = c("controls are IBS, not healty", 
    "controls are IBS, not healty", "controls are IBS, not healty", 
    "controls are IBS, not healty"), ylab = c("Female → Male", 
    "Female → Male", "Female → Male", "irritable bowel syndrome → ulcerative colitis"
    )), .Names = c("Probe_gene", "logFC", "CI_L", "CI_R", "AveExpr", 
"t", "P_Value", "adj_P_Value", "B", "gene", "Study_ID", "Group1", 
"Group2", "Study_ID", "Disease", "DOID", "Title", "GEO_Platform_ID", 
"Platform", "PMID", "Organism", "Data_Type", "Biomaterial", "Study_Type", 
"Samples", "Time_Point", "Treatment", "Treatment_Protocol", "Raw_Data", 
"Notes", "ylab"), row.names = c(NA, 4L), class = "data.frame")

I am using this to construct a plot that has the GSE # (Study_ID), followed by the contrast (Group1 vs Group2) on the y-axis, and logFC as the x-axis. I want to plot a horizontal line between each of the different GSE #'s for visual clarity, but my code doesn't seem to be working.

datasetList = tbl$Study_ID
hLines =(which(duplicated(datasetList) == FALSE) - 0.5)

tbl$ylab <- paste(tbl$Group2," \U2192 ", tbl$Group1, sep = "")


p <- ggplot(data = tbl, aes(x = logFC, y = Probe_gene, group = Study_ID)) +
  geom_point() +
  geom_vline(xintercept = log(0.5,2), size = 0.2) +
  geom_vline(xintercept = log(2/3,2), size = 0.2) +
  geom_vline(xintercept = log(1.5,2), size = 0.2) +
  geom_vline(xintercept = log(2,2), size = 0.2) +
  geom_hline(yintercept = hLines) +
  labs(title = tbl$gene, y = "Contrasts", x = bquote(~Log[2]~'(Fold Change)')) +
  geom_errorbarh(aes(x = logFC, xmin =  CI_L, xmax = CI_R), height = .1) +
  geom_point(aes(colour = cut(adj_P_Value, c(-Inf, 0.01, 0.05, Inf)))) +
  scale_color_manual(name = "P Value",
                     values = c("(-Inf,0.01]" = "red",
                                "(0.01,0.05)" = "orange",
                                "(0.05, Inf]" = "black"),
                     labels = c("<= 0.01", "0.01 < P Value <= 0.05", "> 0.05")) +
  #theme_bw()+
  theme(axis.text.y = element_blank(), strip.text.y = element_text(angle = 180), 
        panel.spacing.y = unit(0,'lines'), axis.ticks.y = element_blank()) +
  facet_grid(Study_ID+ylab~ ., scales = 'free', space = 'free', switch = 'both') 

p

For some reason with the code I have now, ggplot prints many more horizontal lines than I need. It is printing a line in between each GSE #, when I only need it to print a line in between the unique GSE #'s. What I am doing wrong? hLines contains the y-intercepts of where the lines should go.

P.S. As a bit of a side question, if anyone knows of a way for me to specify the shapes that appears (similar to how I specify the colors), that would be very appreciated. In reference to the colors, I need red circles, orange squares, and black crosses for the same conditions that appear in the scale_color_manual() function.

It would be much easier if you provided a reproducible example. — Roman Luštrik, Jul 26 '17 at 20:46
@RomanLuštrik This is pretty specific to the data that I provided/am using. I don't think I could reproduce this with something like `mtcars` or `diamonds` datasets — Kyle Weise, Jul 26 '17 at 20:49
There are more ways of providing the data. If you can't share a small subset of your own, you could simulate it. The MWE should get you thinking about your data more. — Roman Luštrik, Jul 26 '17 at 20:54
@RomanLuštrik I did share a small subset of my own data (?). How else would you like me to provide it? — Kyle Weise, Jul 26 '17 at 20:56
Have a look [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Roman Luštrik, Jul 26 '17 at 20:56
For point shapes, you'll need to include the `shape` aesthetic in addition to the `color` aesthetic in your `geom_point` layer and then use `scale_shape_manual` to choose the shapes you want. — aosmith, Jul 26 '17 at 21:11
@aosmith Can you be a little more descriptive? I've added the `shape` aesthetic to my `geom_point` in the same fashion as I did the color, and used `scale_shape_manual` in exactly the same way I used `scale_color_manual` (except for changing the colors to the corresponding shape values), but it is not showing the behavior I would expect. Looks like it's plotting my desired shapes over ones that are already there(?). — Kyle Weise, Jul 27 '17 at 12:55
It looks like two points on top of each other because that's exactly what's. You have two `geom_point` layers, plotting the same data twice. If that's not what you want, just keep one of them and add `color` and `shape` to that layer. — aosmith, Jul 27 '17 at 14:19

Why ggplot2 geom_hlines plots more than intended?

0 Answers0