Problem with plotting an axis that consists of the number of observations by another column

Question

I have a dataset that shows the human chromosomes with their length (which is here named "value") and their respective genes. Furthermore the genes are divided in 4 groups (gtype) which are RNA, prot-coding, pseudogene and rest. I want to plot the individual chromosomes on the y axis and on the x axis I want to have the density of the chromosomes (which would be "the number of genes per chromosome" divided by the length of the chromosome)with a geom_point,bar or col, then make a facet_wrap based on the gtype of the genes. So 4 plots with one plot only counting RNA genes divided by the value, one plot ony counting the prot-coding genes divided by value.

A picture of a portion of the dataset

One of the codes I tried

Result of the code

The plot just divides the total number of all genes by the individual values (it is however normal that chrM would be the biggest)

However I constantly fail at the x axis and I don't know how to get a plot that makes sense. What I have tried so far is a mix of sum(), count(), nrow() and group_by(). Often the x axis is just the total numbers of rows divided by the "value" or the results don't make sense.

Welcome to SO! To help us to help you could you please make your issue reproducible by sharing a sample of your **data**, the **code** you tried and the **packages** you used? See [how to make a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Please do not post an image of code/data/errors: If you want to post your data type `dput(NAME_OF_DATASET)` into the console and copy & paste the output starting with `structure(....` into your post or e.g. `dput(head(NAME_OF_DATASET, 20))` for the first 20 observations. — stefan, Jun 03 '21 at 21:59

score 0 · Accepted Answer · answered Jun 03 '21 at 22:19

If I understand what you're doing correctly, your data looks a bit like this:

set.seed(3425)
levs <- paste0("chr", c(1:10, "M"))
join_chr <- tibble(
    seqnames = sample(factor(levs, levels = levs), size = 1000, replace = TRUE), 
    gtype = sample(c("pseudogene", "RNA", "prot_coding", "rest"), size = 1000, replace = TRUE), 
    value = round(runif(n = 1000, min = 2e4, max = 2e8)))
join_chr
# # A tibble: 1,000 x 3
#   seqnames gtype           value
#   <fct>    <chr>           <dbl>
# 1 chr1     pseudogene   16170520
# 2 chr8     pseudogene  193230157
# 3 chr9     RNA           6846001
# 4 chr8     prot_coding  64930082
# 5 chr8     pseudogene   11873972
# 6 chr1     pseudogene  136993074
# 7 chr9     rest         53026355
# 8 chr6     prot_coding  36841130
# 9 chr5     prot_coding 157630684
# 10 chr10    prot_coding  29793808
# # … with 990 more rows

You can use summarise to calculate the density. Doing this as a separate step lets you check the summary to make sure it's behaving as you expect.

dens_chr <- join_chr %>% 
    group_by(seqnames, gtype) %>% 
    summarise(density = n() / max(value), .groups = "drop")

dens_chr
# # A tibble: 44 x 3
#    seqnames gtype            density
#    <fct>    <chr>              <dbl>
# 1  chr1     prot_coding 0.0000000872
# 2  chr1     pseudogene  0.000000154 
# 3  chr1     rest        0.000000162 
# 4  chr1     RNA         0.000000119 
# 5  chr2     prot_coding 0.000000110 
# 6  chr2     pseudogene  0.0000000833
# 7  chr2     rest        0.0000000893
# 8  chr2     RNA         0.000000143 
# 9  chr3     prot_coding 0.000000145 
# 10 chr3     pseudogene  0.000000126 
# # … with 34 more rows

You can then plot this.

ggplot(data = dens_chr, 
    mapping = aes(x = density, y = seqnames)) + 
    geom_bar(stat = "summary", fun = max) + 
    facet_grid(. ~ gtype) + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Problem with plotting an axis that consists of the number of observations by another column

1 Answers1