0

I am trying to make a volcano plot for different clusters. I have 2 conditions, untreated vs. treated. I have a differential expression excel file that cellranger generated for me but within the file it has multiple clusters each which have a fold change and p value. How do I create a volcano plot that contains all the clusters rather than one? Would I have to do a volcano plot for each cluster and then combine them all somehow?

I used this code to generate the plot for just one of the clusters...

macrophage_list <- read.table("differential_expression_macrophage.csv", header = T, sep = ",")`

EnhancedVolcano(macrophage_list, lab = as.character(macrophage_list$FeatureName), x = 'Cluster1.Log2.Fold.Change', y = 'Cluster1.Adjusted.P.Value', xlim = c(-8,8), title = 'Macrophage', pCutoff = 10e-5, FCcutoff = 1.5, pointSize = 3.0, labSize = 3.0)

How do I merge all the information in the excel file to create a volcano plot?

I uploaded each data cluster one by one and then merged them by using rbind, but is there a simpler/quicker way to do this?

output for dput(gene_list[1:20, 1:14])

structure(list(Feature.ID = structure(1:20, .Label = c("a", "b", 
"c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", 
"p", "q", "r", "s", "t"), class = "factor"), Feature.Name = structure(1:20, .Label = c("A", 
"B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", 
"O", "P", "Q", "R", "S", "T"), class = "factor"), Cluster.1_Mean.Counts = c(0.000960904, 
0.000320301, 0.001281205, 0.000320301, 0.000320301, 0.016335362, 
0.000960904, 0, 0.001601506, 0.000320301, 0.007046627, 0.026585, 
0.017296265, 0.004804518, 0, 0.874742598, 0.017616566, 0.007366928, 
0.008327831, 0.001921807), Cluster.1_Log2.fold.change = c(0.291978774, 
1.954943787, -2.008530337, -2.482461526, 3.539906287, 0.407455991, 
-0.214981215, 1.539906287, 0.802940693, 2.539906287, -1.333136538, 
-1.879953595, -0.52422405, -0.877946228, 1.539906287, -0.629373147, 
1.118442519, 0.170672478, 1.065975099, 1.099333696), Cluster.1_Adjusted.p.value = c(1, 
0.910243711, 0.04672812, 0.080866038, 0.610296549, 0.80063597, 
1, 1, 0.951841603, 0.797013021, 0.103401275, 0.000594428, 0.907754993, 
0.532689631, 1, 0.480958806, 0.078345008, 1, 0.198557945, 0.668312142
), Cluster.2_Mean.Counts = c(0.000902278, 0.001804555, 0.006315943, 
0.004511388, 0, 0.029775159, 0.001804555, 0, 0.002706833, 0, 
0.023459216, 0.128123411, 0.030677437, 0.009022775, 0, 2.174488883, 
0.018947828, 0.019850106, 0.010827331, 0.000902278), Cluster.2_Log2.fold.change = c(0.792589781, 
4.769869705, 0.35201719, 0.839132367, 3.184907204, 1.32985554, 
0.962514783, 3.184907204, 1.725475586, 2.599944703, 0.560416339, 
0.580736324, 0.407299626, 0.184907204, 3.184907204, 0.816580902, 
1.120776867, 1.742684876, 1.409613491, 0.599944703), Cluster.2_Adjusted.p.value = c(1, 
0.153573448, 1, 0.737977734, 1, 0.14478935, 0.853816767, 1, 0.47952604, 
1, 0.65316285, 0.507251471, 0.776636022, 1, 1, 0.346630571, 0.285006452, 
0.060868933, 0.21546202, 1), Cluster.3_Mean.Counts = c(0.001813813, 
0, 0.019045032, 0.00725525, 0, 0.022672657, 0.000906906, 0, 0, 
0, 0.029927908, 0.043531502, 0.046252221, 0.029021001, 0, 3.146057931, 
0.020858845, 0.013603594, 0.008162157, 0), Cluster.3_Log2.fold.change = c(1.455721575, 
2.192687169, 2.008262598, 1.504631175, 3.192687169, 0.9044422, 
0.334706174, 3.192687169, -0.451169021, 2.607724668, 0.931421856, 
-1.032594057, 1.038258504, 1.970294748, 3.192687169, 1.412371018, 
1.26985503, 1.14829305, 0.991053308, -0.451169021), Cluster.3_Adjusted.p.value = c(0.757752635, 
1, 0.032609935, 0.33316083, 1, 0.441825712, 1, 1, 1, 1, 0.380305075, 
0.605158722, 0.339946318, 0.016952505, 1, 0.056529024, 0.259458704, 
0.339639234, 0.536765022, 1), Cluster.4_Mean.Counts = c(0.000641899, 
0, 0.002567596, 0.004493293, 0, 0.010270384, 0.003209495, 0, 
0.000641899, 0, 0.028243557, 0.160474756, 0.012196081, 0.005135192, 
0, 1.199709274, 0.005135192, 0.004493293, 0.005777091, 0.001283798
), Cluster.4_Log2.fold.change = c(0.269229783, 1.661547206, -0.886889419, 
0.778904157, 2.661547206, -0.289908942, 1.602653517, 2.661547206, 
0.076584705, 2.076584705, 0.854192284, 0.961549693, -0.967809414, 
-0.644261223, 2.661547206, -0.104384578, -0.790579612, -0.467735811, 
0.459913345, 0.722947751), Cluster.4_Adjusted.p.value = c(1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.584036686, 1, 1, 1, 1, 1, 1, 
1, 1)), class = "data.frame", row.names = c(NA, 20L))
zx8754
  • 52,746
  • 12
  • 114
  • 209
Michelle
  • 11
  • 1
  • 4

1 Answers1

0

Based on your dataset, you need to reshape them but first in order to reshape them using the right pattern, we will rename some column names:

colnames(df) <- gsub(".Mean", "_Mean", colnames(df)) 
colnames(df) <- gsub(".Log2", "_Log2", colnames(df))
colnames(df) <- gsub(".Adjus","_Adjus",colnames(df))

Now, we can reshape it using the right pattern with pivot_longer function from tidyr package:

library(tidyr)
final_df <- df %>% pivot_longer(., -c(Feature.ID, Feature.Name), names_to = c("set",".value"), names_pattern = "(.+)_(.+)")

# A tibble: 80 x 6
   Feature.ID Feature.Name set       Mean.Counts Log2.fold.change Adjusted.p.value
   <fct>      <fct>        <chr>           <dbl>            <dbl>            <dbl>
 1 a          A            Cluster.1    0.000961            0.292           1     
 2 a          A            Cluster.2    0.000902            0.793           1     
 3 a          A            Cluster.3    0.00181             1.46            0.758 
 4 a          A            Cluster.4    0.000642            0.269           1     
 5 b          B            Cluster.1    0.000320            1.95            0.910 
 6 b          B            Cluster.2    0.00180             4.77            0.154 
 7 b          B            Cluster.3    0                   2.19            1     
 8 b          B            Cluster.4    0                   1.66            1     
 9 c          C            Cluster.1    0.00128            -2.01            0.0467
10 c          C            Cluster.2    0.00632             0.352           1     
# … with 70 more rows

Now, we can create the volcano plot by using ggplot2 and ggrepel libraries for the labeling of Feature.Name (if you don't have ggrepel, you have to install it):

library(ggplot2)
library(ggrepel)
ggplot(final_df, aes(x = Log2.fold.change,y = -log10(Adjusted.p.value), label = Feature.Name))+ 
  geom_point()+
  geom_text_repel(data = subset(final_df, Adjusted.p.value < 0.05), 
                  aes(label = Feature.Name))

And you get your volcano plot with all clusters merged, all points with the same color, and with labeling of Feature.names with an adjusted p value < 0.05

enter image description here

dc37
  • 15,840
  • 4
  • 15
  • 32
  • Hi dc37, this is exactly what I'm asking. For this following command, `File %>% pivot_longer(., - FeatureName, names_to = c("set",".value"), names_pattern = "(.+)_(.+)"`, am I substituting what is in "" and if so, to what? – Michelle Dec 16 '19 at 18:35
  • The easier way will be to rename first your columns names by using few lines that I post at the end of my answer (with `gsub(...)`. It should make your columns looks like my example and it will be easier to pass the command `pivot_longer` with the `names_pattern` without changing anything. Let me know if it is working. – dc37 Dec 16 '19 at 18:53
  • @Michelle, I edited my answer to provide a solution using your example data. Let me know if it is what you are looking for – dc37 Dec 16 '19 at 21:43
  • I followed all the steps but looking at the dataset after all the adjustments, the column for FeatureNames is gone and therefore I can't label based on gene but rather by cluster. How do I add the FeatureName column back? – Michelle Dec 18 '19 at 16:26
  • Sorry, I made a mistake in the very last line of code. Is it what you are talking about ? – dc37 Dec 18 '19 at 20:02
  • I ran this `EnhancedVolcano(df_reshaped, lab = as.character(df_reshaped$FeatureName), x = 'Log2.fold.change', y = 'Adjusted.p.value', xlim = c(-8,8), title = 'Macrophage', pCutoff = 10e-5, FCcutoff = 1.5, pointSize = 3.0, labSize = 3.0)` and got the following error: Error in `$<-.data.frame`(`*tmp*`, "lab", value = character(0)) : replacement has 0 rows, data has 657888 In addition: Warning message: Unknown or uninitialised column: 'FeatureName'. – Michelle Dec 19 '19 at 00:48
  • Feature_Name doesn't exist in the dataset anymore. When I ran `gene_list %>% pivot_longer(., -c(Feature_ID, Feature_Name), names_to = c("set",".value"), names_pattern = "(.+)_(.+)") %>% + ggplot(., aes(x = Log2.fold.change,y = -log10(Adjusted.p.value), color = set))+ geom_point()` , this is the error I receive... Error: Unknown columns `a`, `b`, `c`, `d`, `e` and ... Run `rlang::last_error()` to see where the error occurred. – Michelle Dec 19 '19 at 00:49
  • can you post the output of `colnames(gene_list)[1:6]` just before lauching the command `%>% pivot ...`? – dc37 Dec 19 '19 at 02:09
  • [1] "Feature.ID" "Feature.Name" "Cluster.1_Mean.Counts" [4] "Cluster.1_Log2.fold.change" "Cluster.1_Adjusted.p.value" "Cluster.2_Mean.Counts" – Michelle Dec 19 '19 at 16:04
  • So, in my example, you should have replace `Feature_ID` and `Feature_Name` by the one you have in your dataframe `Feature.ID` and `Feature.Name`. I edited my answer with corresponding code. Something like that should work `gene_list %>% pivot_longer(., -c(Feature.ID, Feature.Name), names_to = c("set",".value"), names_pattern = "(.+)_(.+)") %>% + ggplot(., aes(x = Log2.fold.change,y = -log10(Adjusted.p.value), color = set))+ geom_point()` – dc37 Dec 20 '19 at 04:41
  • `> gene_list %>% pivot_longer(., -c(Feature.ID, Feature.Name), names_to = c("set",".value"), names_pattern = "(.+)_(.+)") %>% + ggplot(., aes(x = Log2.fold.change,y = -log10(Adjusted.p.value), color = set))+ geom_point() Error: Can't index beyond the end of a vector. The vector has length 0 and you've tried to subset element 1.` – Michelle Dec 21 '19 at 22:32
  • I went to go add the dataset again to start from scratch and this happened `Feature.ID = letters[1:20] > Feature.Name = LETTERS[1:20] > gene_list <- cbind(Feature.ID, Feature.Name, gene_list) Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 20, 31328` – Michelle Dec 21 '19 at 22:34
  • Why are you adding these two columns to your `gene_list` ? Does your `gene_list` dataset does not have already some ID and Name ? If I add them it was because your `dput` was too long to be put entirely in the question, so I re-create a fake dataset. But in your original dataset called `gene_list`, you should already have Feature.ID` and `Feature.Name` – dc37 Dec 22 '19 at 07:50
  • Awesome, I have generated the volcano plot with ggplot, but now how to I go about labeling with the FeatureName? – Michelle Dec 23 '19 at 14:37
  • I tried this... `gene_list %>% pivot_longer(., -c(Feature.ID, Feature.Name), names_to = c("set",".value"), names_pattern = "(.+)_(.+)") %>% + ggplot(., aes(x = Log2.fold.change,y = -log10(Adjusted.p.value), label = gene_list$Feature.Name, color = set))+ geom_point() Error: Aesthetics must be either length 1 or the same as the data (626560): label` to get the FeatureNames. – Michelle Dec 23 '19 at 17:44
  • Also, how to I get everything the same color? Thank you so much for all your help. – Michelle Dec 23 '19 at 18:02
  • I edited my answer accordingly. But if you still have more question, you should create a new question, the comments section is not the best way to ask for help. – dc37 Dec 24 '19 at 04:22
  • Sorry about that. – Michelle Dec 24 '19 at 21:11
  • I'm not sure what it is but the command is taking forever to produce the volcano plot. The first part goes quick but when I run everything as one, it's been taking forever. – Michelle Dec 24 '19 at 21:13
  • It depends of the size of your dataframe. If you have 20 000 differnt points to plot, it can be a bit long – dc37 Dec 24 '19 at 23:51