0

I have a scatterplot of a few thousand points faceted by a factor using facet_wrap. I have successfully split the y-axis using scale_y_break from the ggbreak package and now I would like to label a selection of my data points (named in a vector, top_g) using geom_text_repel from ggrepel. The problem is that ggrepel plots all my labels on the coordinate space both below the y-axis break as well as above. Since all of the points to label are above the break, I get a lot of labels in the lower half of the plot with long lines pointing up. What settungs in scale_y_break or geom_text_repel could help me to have the labels plotted only once, in the upper space of the plot?

Current picture: Faceted Scatterplot of Gene Expression, with y-axis break and labels

I have a data frame which looks like this:

          Gene Patient    Estimate   Pr(>|t|)          Color
A2M        A2M      13 -0.09364448 0.82082825 NS or FC < 0.5
A2M1       A2M      19 -0.60473507 0.06386751       FC > 0.5
A2M2       A2M      24  0.19832106 0.49231696 NS or FC < 0.5
A2M3       A2M      30  0.38843438 0.24560332 NS or FC < 0.5
A4GALT  A4GALT      13  0.45057750 0.07253149 NS or FC < 0.5
A4GALT1 A4GALT      19  0.22712101 0.17208382 NS or FC < 0.5
A4GALT2 A4GALT      24 -0.40367138 0.03971972       p < 0.05
A4GALT3 A4GALT      30  0.45064877 0.04509195       p < 0.05
AAAS      AAAS      13 -0.40079848 0.09217090 NS or FC < 0.5
AAAS1     AAAS      19  0.26819152 0.02358786       p < 0.05
AAAS2     AAAS      24  0.07455519 0.69811176 NS or FC < 0.5
AAAS3     AAAS      30  0.03123206 0.91063219 NS or FC < 0.5

The code I have used to create the plot is the following:

# Set parameters for the plot
fold_change_cutoff <- 0.5
volcano_colours <- c("grey","lightblue","#f0a3db","#8F54A0","#411945")
names(volcano_colours) <- levels(results_by_patient$Color)
top_g <- c("IGFBP3", "MT1X", "MT2A", "CTHRC1", "SLC25A37", "SFTPA1", "MOCOS", "IGHG2", "IGHG4", "IGHG3", "IDH3A", "IGHG1")
# Generate the plot
volcano_plot_overall <- ggplot(results_by_patient,
         aes(x = Estimate, y = -log10(`Pr(>|t|)`),
             color = Color, label = Gene)) +
      geom_vline(xintercept = c(fold_change_cutoff, -fold_change_cutoff), lty = "dashed") +
      geom_hline(yintercept = -log10(p_value_cutoff), lty = "dashed") +
      geom_point() +
      labs(x = "Enriched in TI <- log2(FC) -> Enriched in TC",
           y = bquote(-log[10]("p-value")),
           color = "Significance") +
      scale_color_manual(values = volcano_colours,
                         guide = guide_legend(override.aes = list(size = 4))) +
      scale_y_continuous(expand = expansion(mult = c(0,0.1))) +
      geom_text_repel(data = subset(results_by_patient, Gene %in% top_g & abs(Estimate) > fold_change_cutoff),# & Qvalue_Global < q_value_cutoff),
                      size = 4, point.padding = 0.15, color = "black",
                      min.segment.length = .1, box.padding = .2, 
                      max.overlaps = 50) +
      theme_classic(base_size = 12) +
      theme(legend.position = "bottom") + 
    facet_wrap(~Patient, nrow = 1) +
    ggbreak::scale_y_break(breaks = rep(-log10(p_value_cutoff),2),scales = 4, expand = FALSE)
# View the plot
  volcano_plot_overall

The vector top_g contains the names of the genes that I would like to plot.

> top_g
 [1] "IGFBP3"   "MT1X"     "MT2A"     "CTHRC1"   "SLC25A37" "SFTPA1"   "MOCOS"    "IGHG2"    "IGHG4"    "IGHG3"    "IDH3A"    "IGHG1"   

I tried a different way of drawing the labels, creating a column called Label with label strings for data points that I want to plot, but with NA values for genes that I don't want to plot. I also tried to remove the facet, but this did not help.

top_g_filter <- results_by_patient$Gene %in% top_g & abs(results_by_patient$Estimate) > fold_change_cutoff
  results_by_patient$Label[top_g_filter] <- results_by_patient[top_g_filter, "Gene"]
  volcano_plot_overall_2 <- ggplot(results_by_patient,
         aes(x = Estimate, y = -log10(`Pr(>|t|)`),
             color = Color, label = Label)) +
      geom_vline(xintercept = c(fold_change_cutoff, -fold_change_cutoff), lty = "dashed") +
      geom_hline(yintercept = -log10(p_value_cutoff), lty = "dashed") +
      geom_point() +
      labs(x = "Enriched in TI <- log2(FC) -> Enriched in TC",
           y = bquote(-log[10]("p-value")),
           color = "Significance") +
      scale_color_manual(values = volcano_colours,
                         guide = guide_legend(override.aes = list(size = 4))) +
      scale_y_continuous(expand = expansion(mult = c(0,0.1))) +
      geom_text_repel(size = 4, point.padding = 0.15, color = "black",
                      min.segment.length = .1, box.padding = .2, 
                      max.overlaps = 50) +
      theme_classic(base_size = 12) +
      theme(legend.position = "bottom") + 
    ggbreak::scale_y_break(breaks = rep(-log10(p_value_cutoff),2),scales = 4, expand = FALSE) 
  volcano_plot_overall_2

Same figure as above but without facets and with different labelling code

If anybody wants to import a sample of this data with 96 of the 37560 rows, here is a dump of the data frame (does not include the label column):

results_by_patient_for_export <-
structure(list(Gene = c("IGFBP3", "SFTPA1", "MT2A", "SLC25A37", 
"MOCOS", "CTHRC1", "IDH3A", "MT1X", "IGHG1", "IGHG2", "IGHG3", 
"IGHG4", "IGFBP3", "SFTPA1", "MT2A", "SLC25A37", "MOCOS", "CTHRC1", 
"IDH3A", "MT1X", "IGHG1", "IGHG2", "IGHG3", "IGHG4", "IGFBP3", 
"SFTPA1", "MT2A", "SLC25A37", "MOCOS", "CTHRC1", "IDH3A", "MT1X", 
"IGHG1", "IGHG2", "IGHG3", "IGHG4", "IGFBP3", "SFTPA1", "MT2A", 
"SLC25A37", "MOCOS", "CTHRC1", "IDH3A", "MT1X", "IGHG1", "IGHG2", 
"IGHG3", "IGHG4", "TPD52L2", "SPOP", "YAE1", "KIAA0319L", "HECTD4", 
"TRAPPC12", "GSTM2", "PRSS16", "VAMP3", "CXCL1", "RAB38", "MRPS17", 
"MROH6", "LETMD1", "PHC3", "CMTR1", "KHDRBS1", "GNB2", "RFNG", 
"BTRC", "ARHGAP25", "DPYD", "CNIH1", "SPIB", "PRKCZ", "POLK", 
"PIGQ", "PDIA4", "SLC6A8", "FOLR1", "MRPL53", "CUL9", "WDR25", 
"CCDC82", "DHX38", "TNFRSF21", "PPM1B", "RAVER1", "MTR", "ZC3H15", 
"MYH9", "ANKH", "SPARC", "CCND1", "TOLLIP", "TIA1", "PDCD10", 
"PPP2R5B"), Patient = c("13", "13", "13", "13", "13", "13", "13", 
"13", "13", "13", "13", "13", "19", "19", "19", "19", "19", "19", 
"19", "19", "19", "19", "19", "19", "24", "24", "24", "24", "24", 
"24", "24", "24", "24", "24", "24", "24", "30", "30", "30", "30", 
"30", "30", "30", "30", "30", "30", "30", "30", "19", "24", "13", 
"13", "19", "13", "19", "13", "13", "30", "30", "24", "24", "19", 
"24", "24", "19", "19", "24", "13", "24", "30", "13", "30", "19", 
"24", "24", "24", "19", "24", "19", "19", "24", "24", "24", "19", 
"19", "13", "24", "13", "30", "30", "30", "30", "30", "13", "19", 
"30"), Estimate = c(1.4342758701328928, -0.95936816185395002, 
1.3765890535912415, 0.60615370942926672, -1.010270997593212, 
0.56353243422236887, -0.90255562228792563, 1.5576954958648126, 
-1.3607499236262317, -1.4364118143364513, -1.474110922304231, 
-1.1250304191855622, 3.1276702547909232, -3.3143554695322655, 
0.9624251913317039, 0.5084592549167477, -0.54588374778088244, 
-0.13824110655637623, -0.6052417201126471, 3.7977091871394095, 
-0.71741871486351427, -0.68241455453487398, -0.68519031505256389, 
-0.67509190711461287, 0.7279818740896894, -0.8371731449517027, 
0.6621834292330987, 0.68601080309446616, -0.54741249344639331, 
1.0233680799126907, -0.76690550086857856, 0.53216320363185055, 
-0.84083755741626598, -0.91939098435321354, -0.97376403120535826, 
-0.88391863742066801, -0.51564041463475541, -0.74453358848207518, 
-0.28471451077109111, -0.22909521284674911, -0.044770214916977628, 
0.68239635866331572, 0.15577084253033693, -0.49540662383622563, 
0.098549027264916766, -0.35613186894233162, -0.38333940223368979, 
-0.43088746468894373, 0.5351162057034351, 0.027112990337612539, 
0.15195259496679489, -0.3962375835799139, 0.25002798235856544, 
-0.07057695299871225, 0.79863171699545554, 0.29513008288664477, 
0.57201674855370344, -0.50089052176041227, -0.26468319819506236, 
-0.0081148038621813127, -0.2336531701309047, 0.17000723874768747, 
0.14010918425432589, 0.18737454938419987, 0.44672531215461098, 
-0.033362956135194674, 0.12096884904813542, 0.081772635248707412, 
-0.45350911420250539, 0.22080135450764451, 0.34743885824956194, 
-0.2032755327809011, 0.92009563303147257, 0.022706998886684836, 
-0.25977610831654502, 0.015140050702365676, 1.1464018091062771, 
-0.038828036472541648, -0.15644114183766625, -0.074307333870867423, 
-0.099945432970827974, -0.20100761722591759, 0.15797441169132551, 
0.086752990476589312, -0.11579339896376119, 0.14896415466954385, 
0.50707961300974613, -0.034268211871531895, -0.03941067650003869, 
-0.30431222659251722, 0.64079275967329175, 0.078296987301706519, 
-0.043445685667455242, 0.32461627444485458, 0.095282428367853819, 
-0.24711741016959532), `Pr(>|t|)` = c(0.00054521644956236591, 
0.20844924924849634, 0.0063344881708467032, 0.04690839856354196, 
0.010426543004723813, 0.015327690300849395, 0.00012672535286031001, 
0.0014902029039675733, 0.00032902060622238736, 0.0095163818112773899, 
0.00041057255262061366, 0.0025832008904889104, 3.9052657243800136e-08, 
0.00031557989822396073, 0.0016258358307589705, 0.033893440018941505, 
0.048477691960592727, 0.74798603744403425, 0.0027477084430068809, 
1.3029037134854807e-10, 0.0076116608175266203, 0.0048210474122679084, 
0.0015824143193650824, 0.00068984318394415341, 0.031253983612796923, 
0.045503059810359778, 0.023560209387624528, 0.0021506430320710149, 
0.039887537832338241, 0.012514059956998468, 0.0079174092591247601, 
0.044102721831279078, 0.034534589154438501, 0.017662871943542321, 
0.014178542803463335, 0.018739918520330426, 0.21531375078386988, 
0.007547345201378057, 0.34664827014402366, 0.23592124928581937, 
0.86675590750088505, 0.010740914956562011, 0.37538269351680009, 
0.07326875438459321, 0.7710526024847264, 0.19876854208969691, 
0.2700180593238321, 0.088821455238771108, 5.4060422418354054e-05, 
0.85902012019183616, 0.51278621062991048, 0.068238170943737353, 
0.1067114019857604, 0.54185622331471206, 0.00045391342317257377, 
0.42458757033517591, 0.0031351261383892148, 0.066183649693906679, 
0.26831300173960709, 0.97624479310207934, 0.32802963857025325, 
0.075725490459696815, 0.39453879747887632, 0.25816494412893093, 
1.0744175303137345e-05, 0.69930544875517775, 0.38565565156662995, 
0.56501752365592162, 0.13302284729310845, 0.30997459777531461, 
0.016620932938517161, 0.60853787428757955, 8.6202581176759562e-05, 
0.90431267771378143, 0.35611195951734775, 0.8957129076558038, 
1.901545313467059e-07, 0.80795484621207692, 0.31534016133786252, 
0.66480293941954949, 0.69897965708479459, 0.2190571720859576, 
0.24508617068981289, 0.46460451373712663, 0.48100310611514163, 
0.51297903648061927, 0.061352074933778943, 0.79036877023581065, 
0.69208351526084511, 0.30645914773081245, 0.10527876566915252, 
0.66233329772369676, 0.7606641149420309, 0.080496301712866342, 
0.57695731467415023, 0.21638928987345277), Color = structure(c(5L, 
2L, 5L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 
1L, 5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 5L, 4L, 4L, 5L, 4L, 4L, 
4L, 4L, 4L, 2L, 5L, 1L, 1L, 1L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 
1L, 1L, 1L, 1L, 1L, 5L, 1L, 5L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 
1L, 1L, 1L, 1L, 1L, 3L, 1L, 5L, 1L, 1L, 1L, 5L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L), levels = c("NS or FC < 0.5", 
"FC > 0.5", "p < 0.05", "FC > 0.5 & p < 0.05", "FC > 0.5 & FDR < 0.1"
), class = "factor")), row.names = c("IGFBP3", "SFTPA1", "MT2A", 
"SLC25A37", "MOCOS", "CTHRC1", "IDH3A", "MT1X", "IGHG1", "IGHG2", 
"IGHG3", "IGHG4", "IGFBP31", "SFTPA11", "MT2A1", "SLC25A371", 
"MOCOS1", "CTHRC11", "IDH3A1", "MT1X1", "IGHG11", "IGHG21", "IGHG31", 
"IGHG41", "IGFBP32", "SFTPA12", "MT2A2", "SLC25A372", "MOCOS2", 
"CTHRC12", "IDH3A2", "MT1X2", "IGHG12", "IGHG22", "IGHG32", "IGHG42", 
"IGFBP33", "SFTPA13", "MT2A3", "SLC25A373", "MOCOS3", "CTHRC13", 
"IDH3A3", "MT1X3", "IGHG13", "IGHG23", "IGHG33", "IGHG43", "TPD52L21", 
"SPOP2", "YAE1", "KIAA0319L", "HECTD41", "TRAPPC12", "GSTM21", 
"PRSS16", "VAMP3", "CXCL15", "RAB383", "MRPS172", "MROH62", "LETMD11", 
"PHC32", "CMTR12", "KHDRBS11", "GNB21", "RFNG2", "BTRC", "ARHGAP252", 
"DPYD3", "CNIH1", "SPIB3", "PRKCZ1", "POLK2", "PIGQ2", "PDIA42", 
"SLC6A81", "FOLR12", "MRPL531", "CUL91", "WDR252", "CCDC822", 
"DHX382", "TNFRSF211", "PPM1B1", "RAVER1", "MTR2", "ZC3H15", 
"MYH93", "ANKH3", "SPARC3", "CCND13", "TOLLIP3", "TIA1", "PDCD101", 
"PPP2R5B3"), class = "data.frame")

Thank you in advance for your help, and let me know if I can provide any further information!

  • Instead of trying to find a solution for your problem, I'm going to suggest that doing away with the y-axis break solves your problem and makes the plot better. I'm familiar enough with volcano plots in genomics to know that a broken axis isn't a convention, and it isn't really helping the interpretation in my view. – teunbrand Jul 27 '23 at 10:29
  • I agree that axis breaks do not help in this type of plot. My packages 'ggpp' and 'ggpmisc' provide some statistics and scales that make producing volcano plots easier. By the way, the labels would be much less crowded if you used small text. I have [a gallery of examples](https://www.r4photobiology.info/galleries/quadrant-volcano-plots.html#volcano-plots) including for volcano and quadrant plots built with ggplot2. – Pedro J. Aphalo Aug 22 '23 at 09:33
  • Thank you both for the advice, @teunbrand as well. I see the issues with introducing an axis break to a volcano plot. However, I have other plots (bar plots of cell abundance) with y-axis breaks and the percentage values for each bar made with geom_text appear both under and above the break. Any suggestions on how this can be suppressed? – Gustav Christensson Aug 25 '23 at 13:00

0 Answers0