0

I have a dataset that shows the human chromosomes with their length (which is here named "value") and their respective genes. Furthermore the genes are divided in 4 groups (gtype) which are RNA, prot-coding, pseudogene and rest. I want to plot the individual chromosomes on the y axis and on the x axis I want to have the density of the chromosomes (which would be "the number of genes per chromosome" divided by the length of the chromosome)with a geom_point,bar or col, then make a facet_wrap based on the gtype of the genes. So 4 plots with one plot only counting RNA genes divided by the value, one plot ony counting the prot-coding genes divided by value.

A picture of a portion of the dataset

One of the codes I tried

Result of the code

The plot just divides the total number of all genes by the individual values (it is however normal that chrM would be the biggest)

However I constantly fail at the x axis and I don't know how to get a plot that makes sense. What I have tried so far is a mix of sum(), count(), nrow() and group_by(). Often the x axis is just the total numbers of rows divided by the "value" or the results don't make sense.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
FreshyMC
  • 3
  • 1
  • Welcome to SO! To help us to help you could you please make your issue reproducible by sharing a sample of your **data**, the **code** you tried and the **packages** you used? See [how to make a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Please do not post an image of code/data/errors: If you want to post your data type `dput(NAME_OF_DATASET)` into the console and copy & paste the output starting with `structure(....` into your post or e.g. `dput(head(NAME_OF_DATASET, 20))` for the first 20 observations. – stefan Jun 03 '21 at 21:59

1 Answers1

0

If I understand what you're doing correctly, your data looks a bit like this:

set.seed(3425)
levs <- paste0("chr", c(1:10, "M"))
join_chr <- tibble(
    seqnames = sample(factor(levs, levels = levs), size = 1000, replace = TRUE), 
    gtype = sample(c("pseudogene", "RNA", "prot_coding", "rest"), size = 1000, replace = TRUE), 
    value = round(runif(n = 1000, min = 2e4, max = 2e8)))
join_chr
# # A tibble: 1,000 x 3
#   seqnames gtype           value
#   <fct>    <chr>           <dbl>
# 1 chr1     pseudogene   16170520
# 2 chr8     pseudogene  193230157
# 3 chr9     RNA           6846001
# 4 chr8     prot_coding  64930082
# 5 chr8     pseudogene   11873972
# 6 chr1     pseudogene  136993074
# 7 chr9     rest         53026355
# 8 chr6     prot_coding  36841130
# 9 chr5     prot_coding 157630684
# 10 chr10    prot_coding  29793808
# # … with 990 more rows

You can use summarise to calculate the density. Doing this as a separate step lets you check the summary to make sure it's behaving as you expect.

dens_chr <- join_chr %>% 
    group_by(seqnames, gtype) %>% 
    summarise(density = n() / max(value), .groups = "drop")

dens_chr
# # A tibble: 44 x 3
#    seqnames gtype            density
#    <fct>    <chr>              <dbl>
# 1  chr1     prot_coding 0.0000000872
# 2  chr1     pseudogene  0.000000154 
# 3  chr1     rest        0.000000162 
# 4  chr1     RNA         0.000000119 
# 5  chr2     prot_coding 0.000000110 
# 6  chr2     pseudogene  0.0000000833
# 7  chr2     rest        0.0000000893
# 8  chr2     RNA         0.000000143 
# 9  chr3     prot_coding 0.000000145 
# 10 chr3     pseudogene  0.000000126 
# # … with 34 more rows

You can then plot this.

ggplot(data = dens_chr, 
    mapping = aes(x = density, y = seqnames)) + 
    geom_bar(stat = "summary", fun = max) + 
    facet_grid(. ~ gtype) + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Multiple barplots panelled by gene type

CSJCampbell
  • 2,025
  • 15
  • 19