Long vector-plot/Coverage plot in R

Question

I really need your R skills here. Been working with this plot for several days now. I'm a R newbie, so that might explain it.

I have sequence coverage data for chromosomes (basically a value for each position along the length of every chromosome, making the length of the vectors many millions). I want to make a nice coverage plot of my reads. This is what I got so far: enter image description here

Looks alright, but I'm missing y-labels so I can tell which chromosome it is, and also I've been having trouble modifying the x-axis, so it ends where the coverage ends. Additionally, my own data is much much bigger, making this plot in particular take extremely long time. Which is why I tried this HilbertVis plotLongVector. It works but I can't figure out how to modify it, the x-axis, the labels, how to make the y-axis logged, and the vectors all get the same length on the plot even though they are not equally long.

source("http://bioconductor.org/biocLite.R")
biocLite("HilbertVis")
library(HilbertVis)
chr1 <- abs(makeRandomTestData(len=1.3e+07)) 
chr2 <- abs(makeRandomTestData(len=1e+07)) 

par(mfcol=c(8, 1), mar=c(1, 1, 1, 1), ylog=T)

# 1st way of trying with some code I found on stackoverflow
# Chr1
plotCoverage <- function(chr1, start, end) { # Defines coverage plotting function.
  plot.new()
  plot.window(c(start, length(chr1)), c(0, 10))
  axis(1, labels=F) 
  axis(4)
  lines(start:end, log(chr1[start:end]), type="l")
}
plotCoverage(chr1, start=1, end=length(chr1)) # Plots coverage result.

# Chr2
plotCoverage <- function(chr2, start, end) { # Defines coverage plotting function.
  plot.new()
  plot.window(c(start, length(chr1)), c(0, 10))
  axis(1, labels=F) 
  axis(4)
  lines(start:end, log(chr2[start:end]), type="l")
}
plotCoverage(chr2, start=1, end=length(chr2)) # Plots coverage result.


# 2nd way of trying with plotLongVector
plotLongVector(chr1, bty="n", ylab="Chr1") # ylab doesn't work
plotLongVector(chr2, bty="n")

Then I have another vector called genes that are of special interest. They are about the same length as the chromosome-vectors but in my data they contain more zeroes than values.

genes_chr1 <- abs(makeRandomTestData(len=1.3e+07)) 
genes_chr2 <- abs(makeRandomTestData(len=1e+07))

These gene vectors I would like plotted as a red dot under the chromosomes! Basically, if the vector has a value there (>0), it is presented as a dot (or line) under the long vector plot. This I have not idea how to add! But it seems fairly straightforward.

Please help me! Thank you so much.

It would be nice if you could whip up a reproducible example. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Roman Luštrik, Jan 31 '13 at 15:51
For long data you probably want to bin it first, e.g., to the resolution of the screen. The Bioconductor packages [ggbio](http://bioconductor.org/packages/2.11/bioc/html/ggbio.html) and [Gviz](http://bioconductor.org/packages/2.11/bioc/html/Gviz.html) might be useful here. — Martin Morgan, Jan 31 '13 at 15:59
Hi Roman. I tried to make the data reproducible with the makeRandomTestData() function from the HilbertVis package! I am not aware of any other randomdata-function that creates similar random data. Is it not working? — user1678000, Jan 31 '13 at 16:15
I added the biocLite() bioconductor download function at the top, perhaps that was what you were asking for! — user1678000, Jan 31 '13 at 16:18
Thank you Martin, I will look into them. ggbio looks very interesting! Very nice manual. — user1678000, Jan 31 '13 at 16:22
@MartinMorgan Dear Martin, I could not get ggbio to work. I tried with normal version and devel version. The very first plot in the manual gives this error: autoplot(gr) Error in rep.int("", ncol(r)) : incorrect type for second argumen — user1678000, Feb 01 '13 at 14:07

Arun · Answer 1 · 2013-02-01T11:14:55.977

5

DISCLAIMER: Please do not simply copy and paste this code to run off the entire positions of your chromosome. Please sample positions (for example, as @Gx1sptDTDa shows) and plot those. Otherwise you'd probably get a huge black filled rectangle after many many hours, if your computer survives the drain.

Using ggplot2, this is really easily achieved using geom_area. Here, I've generated some random data for three chromosomes with 300 positions, just to show an example. You can build up on this, I hope.

# construct a test data with 3 chromosomes and 100 positions
# and random coverage between 0 and 500
set.seed(45)
chr <- rep(paste0("chr", 1:3), each=100)
pos <- rep(1:100, 3)
cov <- sample(0:500, 300)
df  <- data.frame(chr, pos, cov)

require(ggplot2)
p <- ggplot(data = df, aes(x=pos, y=cov)) + geom_area(aes(fill=chr))
p + facet_wrap(~ chr, ncol=1)

ggplot2_geom_area_coverage_plot

edited Feb 01 '13 at 11:14

answered Jan 31 '13 at 17:38

Arun

116,683
26
284
387

Thank you for your suggestion. I tried your plot with data from only one of my chromosomes. I left it trying to produce the plot for about 25 minutes, but the process used up all the 32 GB RAM available so I was unable to do anything else, could not even kill the process. Eventually about an hour later, I was able to kill it. So in summary, that didn't work at all. And a warning to anyone else reading this, do NOT try this plot with coverage on basepair level, since it will be unable to handle so much data. – user1678000 Feb 01 '13 at 09:23
I think this is expected if you run the plot on ALL the positions of the chromosome. And what can you see after the plot by plotting all the positions? You should consider sampling from chromosomal positions and producing a plot that would be meaningful and interpretable. – Arun Feb 01 '13 at 10:22
Yes, a sliding window method would be good. Though I don't know how to do that. – user1678000 Feb 01 '13 at 11:16

Gx1sptDTDa · Answer 2 · 2013-02-02T11:37:56.840

1

You could use the ggplot2 package.

I'm not sure what exactly you want, but here's what I did: enter image description here This has 7000 random data points (about double the amount of genes on Chromosome 1 in reality). I used alpha to show dense areas (not many here, as it's random data).

library(ggplot2)
Chr1_cov <- sample(1.3e+07,7000)
Chr1 <- data.frame(Cov=Chr1_cov,fil=1)
pl <- qplot(Cov,fil,data=Chr1,geom="pointrange",ymin=0,ymax=1.1,xlab="Chromosome 1",ylab="-",alpha=I(1/50))
print(pl)

And that's it. This ran in less than a second. ggplot2 has a humongous amount of settings, so just try some out. Use facets to create multiple graphs.

The code beneath is for a sort of moving average, and then plotting the output of that. It is not a real moving average, as a real moving average would have (almost) the same amount of data points as the original - it will only make the data smoother. This code, however, takes an average for every n points. It will of course run quite a bit faster, but you will loose a lot of detailed information.

VeryLongVector <- sample(500,1e+07,replace=TRUE)

movAv <- function(vector,n){
    chops <- as.integer(length(vector)/n)
    count <- 0
    pos <- 0
    Cov <-0
    pos[1:chops] <- 0
    Cov[1:chops] <- 0
    for(c in 1:chops){
        tmpcount <- count + n
        tmppos <- median(count:tmpcount)
        tmpCov <- mean(vector[count:tmpcount])
        pos[c] <- tmppos
        Cov[c] <- tmpCov
        count <- count + n
    }

    result <- data.frame(pos=pos,cov=Cov)
    return(result)
}

Chr1 <- movAv(VeryLongVector,10000)
qplot(pos,cov,data=Chr1,geom="line")

enter image description here

edited Feb 02 '13 at 11:37

answered Jan 31 '13 at 16:38

Gx1sptDTDa

1,558
2
11
23

I tried the ggplot advised above and had major trouble with R unable to handle and plot so many datapoints, even though I only used one of the chromosomes. So I'm a bit hesitant to try this one since I think the same problem will arise. :( – user1678000 Feb 01 '13 at 09:29
I've put the above code in a function, and ran system.time() on it: 1.766 seconds on my machine. But this is of course with just 7000 data points. Not the millions you have (do you REALLY need every single base position filled? That's about 3 **billion** positions in total for the human genome). – Gx1sptDTDa Feb 01 '13 at 13:44
I've tested with different sizes of the Chr1_cov vector a bit. Up 'till 100,000 points, it takes about 10 seconds. Extrapolating that model results in about a 1,000s runtime for 1e07 data points. That's about 20 minutes. IF you don't run out of memory, that is. – Gx1sptDTDa Feb 01 '13 at 14:02
One chromosome is about 30 million basepairs, and yes I have basepair resolution, but for plotting purposes it doesn't have to be basepair resolution! The best thing I guess would be some kind of sliding window that takes the average over 10-20 basepair or so, and then I can plot this value instead. Though I don't know how do do that. :/ – user1678000 Feb 01 '13 at 14:05
Thank you so much for your code and help. I have been working with this for several hours today. [This is how it currently looks](http://imgur.com/cyMqskc). And here is my code as to how I produced the plot (based on your function above): ggplot(data=Chr1, aes(x=pos,y=log(1+cov))) + geom_area() + theme_bw() + theme(panel.border=element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) – user1678000 Feb 02 '13 at 15:50
Now I'm struggling with figuring out how I can have several rows with several chromosomes of different length, similar to how it looked in my original picture, and additionally the dots for the genes under every chromosome. – user1678000 Feb 02 '13 at 15:50

Long vector-plot/Coverage plot in R

2 Answers2