1

I am trying to get boxplots for 4 different genes with the expression data for each gene across multiple patients.

I've tried multiple ways and just keep hitting errors. I can do it using the base boxplot() function, but can't figure it out in ggplot and I can't see anywhere to help - spent hours reading other answers and questions yesterday! Mostly all other data seems to be as 2 columns so can specify x = column a and y = column b. However, I want to plot all 4 columns of my entire df and I couldn't find any help with that. I can do one at a time in ggplot but not all 4 together.

The data I have, BCON_sig_genes, is 4 genes each with values between 3-6 for 152 samples. The df is 152 obs of 4 variables, where the 4 columns are headed each of the gene names and all the cells are values as shown below.

         CD3E      LAT    ZAP70      LCK

1002 4.214679 5.652482 4.788204 5.393783

1022 4.424925 5.776641 4.864269 5.593587

8035 4.327270 5.725364 4.509920 4.961659

8037 4.415715 5.494048 4.435241 5.081846

9004 4.290078 5.265329 4.799106 5.275424

9005 4.233490 5.338098 4.666506 5.069394

The following code gets me one gene at a time, by substituting in the name of the gene.

BCON_sig_genes %>% ggplot(aes(y = CD3E, x = "CD3E"))+ geom_boxplot()

ggplot boxplot 1 gene only I have tried gene <- colnames(BCON_sig_genes) and then inputting x = gene but it doesn't work and comes up with the following error message:

Error: Aesthetics must be either length 1 or the same as the data (152): x

I think I need to sort out what y is. I tried leaving blank so it would take all the data and sort for each column but no luck.

I tried using a gather() function and making key and value but I couldn't quite figure it out without getting errors... but this felt like I was on the right track!

With the base function all I have to do it boxplot(BCON_sig_genes) and it just plots all 4 genes on a graph with the correct values. base function boxplot all genes

I think I need to wrangle the data better for ggplot so I can tell it that y is just all the expression values for each column but I'm not sure how.

Any help would be much appreciated!!

Thanks, Vicky

vzsmith
  • 13
  • 4
  • 1
    Can you provide a mock data frame and possibly an image of the graph you want to produce? – Kota Mori Jul 16 '20 at 11:32
  • Welcome to SO! As @KotaMori said, please provide a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, it sounds like you maybe want to store your data as [tidy data](https://r4ds.had.co.nz/tidy-data.html) – starja Jul 16 '20 at 11:40
  • Hi, I've edited the post now to show head of the data frame and the 2 boxplots I can currently produce! Hope that's a bit clearer now? Thanks – vzsmith Jul 16 '20 at 12:04

1 Answers1

0

For ggplot to work, you need to get the data in a long format. Which basically means you get the gene names in column 1 and their expression in column 2. You had the right idea with gather but gather is being replaced with pivot_longer.

 library(tidyverse)

data %>% 
  pivot_longer(cols = CD3E:LCK, 
               names_to = "gene", 
               values_to = "expression") %>% 
  ggplot(aes(x = gene,
             y = expression)) +
  geom_boxplot()
NotThatKindODr
  • 729
  • 4
  • 14