0

I'm attempting to plot a stacked barplot with ggplot2 with this code

barplot <- ggplot() + geom_bar(aes(y = percentage, x = TBD, fill = TBD), data = charts.data, stat="identity")

I want to create a barplot for my single cell analysis that has 2 conditions, a treated and an untreated condition. I want to show with the barplot, the percentage of different cell types per condition to see whether the treated with having an effect on the different cell types.

How do I go about determining the percent of each cell type in each condition and then go about plotting the barplot?

output of dput(head(comparison))

structure(c(6051L, 1892L, 1133L, 893L, 148L, 868L, 5331L, 3757L, 
1802L, 1061L, 2786L, 704L), .Dim = c(6L, 2L), .Dimnames = structure(list(c("Fibroblast", "T cell", "Macrophage", "Stellate", "Acinar", "Endothelial"), c("treated", "untreated")), .Names = c("", 
"")), class = "table")

output of dput(head(cell_cycle_data))

structure(list(orig.ident = c("treated", "treated", "treated", 
    "treated", "treated", "treated"), nCount_RNA = c(1892, 307, 1348, 
    3699, 4205, 4468), nFeature_RNA = c(960L, 243L, 765L, 1612L, 
    1341L, 1644L), percent.mt = c(0.211416490486258, 1.62866449511401, 
    4.45103857566766, 4.4065963773993, 0.0713436385255648, 3.87197851387645
    ), RNA_snn_res.0.5 = structure(c(11L, 11L, 5L, 6L, 11L, 13L), .Label = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", 
    "13", "14", "15", "16", "17", "18", "19"), class = "factor"), seurat_clusters = structure(c(11L, 11L, 5L, 6L, 11L, 13L), .Label = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19"), class = "factor"), S.Score = c(0.476893835992198, -0.0200784617568548, -0.0335915198305002, -0.0247184276246385, 0.010785196602457, 0.0190008903712199), G2M.Score = c(0.204441469200986, 0.173804859670862, -0.0313235510969097, -0.0376796363661889, -0.0559526905696905, -0.0122031631356698), Phase = structure(c(3L, 2L, 1L, 1L, 3L, 3L), .Label = c("G1", "G2M", "S"), class = "factor"), old.ident = structure(c(7L,7L, 1L, 4L, 7L, 9L), .Label = c("Fibroblast", "T cell", "Macrophage", "Stellate", "Acinar", "Endothelial", "Tumor", "B cell", "Mast cell", "Ductal", "Islets of Langerhans"), class = "factor")), row.names = c("treated_AAACGCTAGCGGGTTA-1", "treated_AAAGGTAAGTACAGAT-1", "treated_AAAGTGAGTTTGATCG-1", "treated_AAATGGACAAAGTGTA-1", 
    "treated_AACAAAGGTCGACTTA-1", "treated_AACAGGGTCCTAGCCT-1"), class = "data.frame")

output of dput(tail(comparison))

structure(list(orig.ident = c("untreated", "untreated", "untreated", 
"untreated", "untreated", "untreated"), nCount_RNA = c(901, 823, 
1184, 1835, 1147, 1407), nFeature_RNA = c(482L, 479L, 649L, 1043L, 
604L, 709L), percent.mt = c(1.77580466148724, 2.91616038882138, 
4.22297297297297, 3.86920980926431, 2.0052310374891, 4.05117270788913
), RNA_snn_res.0.5 = structure(c(7L, 7L, 7L, 14L, 7L, 7L), .Label = c("0", 
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", 
"13", "14", "15", "16", "17", "18", "19"), class = "factor"), 
    seurat_clusters = structure(c(7L, 7L, 7L, 14L, 7L, 7L), .Label = c("0", 
    "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", 
    "12", "13", "14", "15", "16", "17", "18", "19"), class = "factor"), 
    S.Score = c(-0.0320858200243315, 0.0304725660342869, 0.0215996091745327, 
    0.0384166213301423, 0.144956251122548, -0.0242770509986111
    ), G2M.Score = c(0.0904224391544142, 0.050148242050667, -0.0178041670730754, 
    -0.0112596867977946, -0.0519554524339088, -0.0136533184257381
    ), Phase = structure(c(2L, 2L, 3L, 3L, 3L, 1L), .Label = c("G1", 
    "G2M", "S"), class = "factor"), old.ident = structure(c(5L, 
    5L, 5L, 5L, 5L, 5L), .Label = c("Fibroblast", "T cell", "Macrophage", 
    "Stellate", "Acinar", "Endothelial", "Tumor", "B cell", "Mast cell", 
    "Ductal", "Islets of Langerhans"), class = "factor")), row.names = c("untreated_TTTGGTTGTCTAATCG-18", 
"untreated_TTTGGTTTCCCGAGGT-18", "untreated_TTTGTTGAGAACTGAT-18", 
"untreated_TTTGTTGAGCTCGGCT-18", "untreated_TTTGTTGAGTGCCTCG-18", 
"untreated_TTTGTTGCACGGTGCT-18"), class = "data.frame")
Michelle
  • 11
  • 1
  • 4
  • 1
    It will be easier to answer your question, if you provide a reproducible example of what your data looks like. see: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – dc37 Jan 11 '20 at 23:29
  • I don't have much data to show besides the gene lists, which aren't characterized by cell types. – Michelle Jan 12 '20 at 01:49
  • But how do you want to get your plot, if you don't know cell types and conditions ? – dc37 Jan 12 '20 at 01:51
  • My conditions are treated and untreated. I have genes and I've classified the major cell types but don't have every single gene classified. I don't know how to determine the percentage of each. – Michelle Jan 12 '20 at 02:12
  • So you have a list of genes and associated cell types ? you should provide this as an example in order we understand what is your output in treated and untreated condition – dc37 Jan 12 '20 at 02:15
  • Yes, but only for the top markers that I've found. – Michelle Jan 12 '20 at 02:24
  • Sorry, it is still missing a lot of information to be able to help you. If you can find a way to share the definition of at least two cell types as well as the output of your single cell analysis for treat and untreated conditions, maybe we can assist you with that. – dc37 Jan 12 '20 at 02:36
  • I guess my question would be, how do I go about determining the percentages? For each condition all the same cell types appear, but depending on the condition, varies on the size of cluster. – Michelle Jan 12 '20 at 02:45
  • Based on your data, what cluster1,2,3 ... correspond ? what feature names correspond ? How are you defining that this is `macrophage` cluster ? – dc37 Jan 15 '20 at 00:21
  • Within loupe browser for single cell analysis, I reanalyzed the macrophage cluster from the aggregate. I used known markers to identify that cluster as macrophage. Would just an up-regulated list of gene be better? – Michelle Jan 15 '20 at 00:28
  • It does not help to understand what is cluster 1 / cluster 2 / cluster 3, feature.id and feature.name. Can you define those ? – dc37 Jan 15 '20 at 00:30
  • When I reanalyzed the cluster, it generates an aggregate with different number of clusters, each defined in the excel sheet. Feature name is the gene name and feature id can be ignored. – Michelle Jan 15 '20 at 00:32
  • So, basically for each gene name, it associated a logFC value and pvalue depending of the cluster, am I right ? So, for example, cluster 1 will be the macrophage cluster. However to answer your question, your program does not return the number of cells in each cluster so I don't see how you can get the percentage you are looking for – dc37 Jan 15 '20 at 00:34
  • That is correct, but that whole file is for the macrophage cluster. And no it doesn't but is there a way to still do a stacked bar graph is that file? – Michelle Jan 15 '20 at 00:36
  • Ok, so basically you have one similar file for each "main" cluster ? In that case does it mean that each column "cluster1","cluster2" of the macrophage cluster are "cell1", "cell2" of this cluster ? – dc37 Jan 15 '20 at 00:42
  • What do you mean by cell1, cell2? – Michelle Jan 15 '20 at 00:47
  • If your example, there is 4 clusters defined in the macrophage cluster file. Does they designate a cluster in the macrophage cluster? or a single cell in the macrophage cluster ? – dc37 Jan 15 '20 at 00:49
  • A cluster within the the macrophage cluster. – Michelle Jan 15 '20 at 00:55
  • I see, so I'm afraid that based on this file, you don't have way to know the number of cells in each cluster or sub-clusters. I think you need to look in parameters of the loupe browser for single cell analysis to see if you can get these numbers. Sorry – dc37 Jan 15 '20 at 00:58

1 Answers1

1

Without knowing the structure of your data, it's really hard to guess what will be the good code for your example.

however, if we assume that you have for each conditions, you have a list of individual cells, each with a particular label corresponding to their cell type such as in the following example:

set.seed(123)
Untreated <- data.frame(Cell_Type = sample(LETTERS[1:4],10, replace = TRUE))
Treated <- data.frame(Cell_Type =sample(LETTERS[1:4],25, replace = TRUE))

  Cell_Type
1         C
2         C
3         C
4         B
5         C
6         B
...       ...

You can use dplyr to first bind_rows:

library(dplyr)
Untreated <- Untreated %>% mutate(Condition = "Untreated")
Treated <- Treated %>% mutate(Condition = "Treated")
DF <- bind_rows(Untreated, Treated)

  Cell_Type Condition
1         C Untreated
2         C Untreated
3         C Untreated
4         B Untreated
5         C Untreated
6         B Untreated

Then, you can count for the number of each cell type into each condition and express it as a percentage:

DF <- DF %>% group_by(Condition, Cell_Type) %>% 
  summarise(Nb = n()) %>%
  mutate(C = sum(Nb)) %>%
  mutate(percent = Nb/C*100)

# A tibble: 7 x 5
# Groups:   Condition [2]
  Condition Cell_Type    Nb     C percent
  <chr>     <chr>     <int> <int>   <dbl>
1 Treated   A             7    25     28.
2 Treated   B             7    25     28.
3 Treated   C             6    25     24 
4 Treated   D             5    25     20 
5 Untreated A             1    10     10 
6 Untreated B             4    10     40 
7 Untreated C             5    10     50 

Then, you can plot the results a stacked barchart for each condition and filled each color according to the Cell_Type:

library(ggplot2)
ggplot(DF, aes(x = Condition, y = percent, fill = Cell_Type))+
  geom_bar(stat = "identity")+
  geom_text(aes(label = paste(percent,"%")), position = position_stack(vjust = 0.5))

enter image description here

EDIT: Plotting using data provided by the OP

Using the data you provided in your question, you can do:

df <- structure(c(6051L, 1892L, 1133L, 893L, 148L, 868L, 5331L, 3757L, 
            1802L, 1061L, 2786L, 704L), .Dim = c(6L, 2L), .Dimnames = structure(list(c("Fibroblast", "T cell", "Macrophage", "Stellate", "Acinar", "Endothelial"), c("treated", "untreated")), .Names = c("", 
                                                                                                                                                                                                          "")), class = "table")
df <- data.frame(df)

Which gives the following dataframe:

          Var1      Var2 Freq
1   Fibroblast   treated 6051
2       T cell   treated 1892
3   Macrophage   treated 1133
4     Stellate   treated  893
5       Acinar   treated  148
6  Endothelial   treated  868
7   Fibroblast untreated 5331
8       T cell untreated 3757
9   Macrophage untreated 1802
10    Stellate untreated 1061
11      Acinar untreated 2786
12 Endothelial untreated  704

And then, you can rename your column, calculate the percent of each cell type for each condition:

library(dplyr)
DF <- df %>% rename(Cell_Type = Var1, Condition = Var2) %>%
  group_by(Condition) %>% 
  mutate(Percent = Freq / sum(Freq)*100)

# A tibble: 12 x 4
# Groups:   Condition [2]
   Cell_Type   Condition  Freq Percent
   <fct>       <fct>     <int>   <dbl>
 1 Fibroblast  treated    6051   55.1 
 2 T cell      treated    1892   17.2 
 3 Macrophage  treated    1133   10.3 
 4 Stellate    treated     893    8.13
 5 Acinar      treated     148    1.35
 6 Endothelial treated     868    7.90
 7 Fibroblast  untreated  5331   34.5 
 8 T cell      untreated  3757   24.3 
 9 Macrophage  untreated  1802   11.7 
10 Stellate    untreated  1061    6.87
11 Acinar      untreated  2786   18.0 
12 Endothelial untreated   704    4.56

Then, for the plotting part:

library(ggplot2)
ggplot(DF, aes(x = Condition, y = Percent, fill = Cell_Type))+
  geom_bar(stat = "identity")+
  geom_text(aes(label = paste(round(Percent,2),"%")), position = position_stack(vjust =  0.5))

enter image description here

Does it answer your question ?

dc37
  • 15,840
  • 4
  • 15
  • 32
  • I have a differential expression file for each condition for each specific cell type, would that work as an input? – Michelle Jan 14 '20 at 18:23
  • No, you will have to join these files first. I'm sorry but without showing a small example of these files, I can't guide you through that – dc37 Jan 14 '20 at 19:10
  • Maybe you can post the first lines of the differential expression file obtained for each condition for a single cell types. (use `dput(head(df))` to generate the example). – dc37 Jan 14 '20 at 20:59
  • I added the differential expression file for the macrophage cluster for untreated and treated. – Michelle Jan 15 '20 at 00:12
  • Within Seurat, I was able to get the percentage for each cluster, and condition, do you know how to export that file and then do the stacked bar graph? – Michelle Jan 15 '20 at 22:09
  • I never used Seurat. Maybe you can export as `csv` or `txt` files. If so, you can then load into R and provide a small portion of this file into your question. Then, we should be able to guide you for the stacked bar graph. – dc37 Jan 15 '20 at 22:18
  • I have update my question with a portion of the file. – Michelle Jan 20 '20 at 01:56
  • Yes it is. I now want to create another stacked bar-graph but with the different cell cycles for each cell type within the cell type. I have uploaded the file for it. Don't know how to go about it as now I have another condition (cell cycle). I was able to do the bar graph comparing cell cycle per condition but now I want to compare cell type and cell cycle in each condition. – Michelle Jan 20 '20 at 22:46
  • Don't know if more of the data frame is needed, as I only did the head of it. – Michelle Jan 20 '20 at 22:46
  • Moreover, having long exchange in comments is not really desired as it tends to make the reading really difficult for visitors looking for a solution to a similar problem. – dc37 Jan 20 '20 at 22:56
  • I posted the tail end of the data frame. – Michelle Jan 20 '20 at 23:18