1

I would like to make a relatively simple plot (reminiscent of timelines such as this: http://www.ats.ucla.edu/stat/sas/code/timeline.gif), but instead of time on the x-axis, it will be base positions in a genome. The "time spans" will be coverage distances for DNA-sequence scaffolds, showing the spans of where they fall in the genome, where they overlap and places with no coverage. Here is a crude mock-up of what I am looking for, showing contig coverage of rRNAs,(I left out, but need, an x-axis showing positions the starts and stops, and labeling of the contigs (colored lines)): https://i.stack.imgur.com/lM1EP.png , with the following coordinates:

Contig# Start1  Stop1   Start2  Stop2   Start3  Stop3   Start4  Stop4
1   1   90  90  100 120 150 200 400
2   1   100 120 150 200 400 NA  NA
3   1   30  90  100 120 135 200 400
4   1   100 120 140 200 400 NA  NA
5   -35 80  90  100 130 150 200 400
6   1   100 200 300 360 400 NA  NA

I am pretty sure this can be done in R, probably with ggplot2, but for some reason I cannot figure it out.

user1669785
  • 113
  • 1
  • 3
  • 10
  • 2
    What's your data look like? What code have you tried so far? – Matt Parker Sep 13 '12 at 21:01
  • 1
    Does not look at all like what I call a waterfall plot. – IRTFM Sep 13 '12 at 21:56
  • I have created such plots in `R` (see my questions on the subject [here](http://stackoverflow.com/questions/9607527/r-using-the-segments-function-to-plot-a-map-of-stacked-lines) and [here](http://stackoverflow.com/questions/9871043/increasing-the-performance-of-visualising-overlapping-segments) ) and yes it is possible in both `ggplot` and base graphics. But to help you further we need you to provide a sample of you input. – MattLBeck Sep 14 '12 at 14:25
  • @Matt Parker - I edited the post to put an example of the data as mock coordinates above. I was actually trying to do with gantt.chart using the plotrix package, which produces a figure much like I would like, but using time as the x-axis is hard-coded in and doesn't work for me. I haven't used ggplot before, but reading it seems like the way to go, but a steep learning curve. – user1669785 Sep 14 '12 at 14:29
  • @DWin What I was visualizing looks much like a horizontal waterfall plot with thinner than average bars. I thought a modified code for one might work for this. – user1669785 Sep 14 '12 at 14:43

2 Answers2

5

This is not going to be as organized as your plot but it puts the lines in with coordinates that you have yet to provide:

dfr <- data.frame(seg=sample(1:6, 20, replace=TRUE), start=sample(1:100, 20, replace=TRUE), end=sample(1:100,20, replace=TRUE) )
 plot(c(1,100), c(1,6), type="n")
 with(dfr, segments(y0=seg, y1=seg, x0=start, x1=end, col=2:7, lwd=3))

With new dataset:

 Contig <- read.table(text=" Start1  Stop1 Start2 Stop2 Start3 Stop3 Start4 Stop4
 1   1   90  90  100 120 150 200 400
 2   1   100 120 150 200 400 NA  NA
 3   1   30  90  100 120 135 200 400
 4   1   100 120 140 200 400 NA  NA
 5   -35 80  90  100 130 150 200 400
 6   1   100 200 300 360 400 NA  NA")
 # the reshape function can be tricky.... but seems to finally work.
 reshape(Contig, direction="long", sep="",
     varying=list(Start=names(Contig)[c(1,3,5,7)],
                   Stop=names(Contig)[c(2,4,6,8)] )  )
#------------------------------
    time Start1 Stop1 id
1.1    1      1    90  1
2.1    1      1   100  2
3.1    1      1    30  3
4.1    1      1   100  4
5.1    1    -35    80  5
6.1    1      1   100  6
1.2    2     90   100  1
2.2    2    120   150  2
3.2    2     90   100  3
4.2    2    120   140  4
5.2    2     90   100  5
6.2    2    200   300  6
1.3    3    120   150  1
2.3    3    200   400  2
3.3    3    120   135  3
4.3    3    200   400  4
5.3    3    130   150  5
6.3    3    360   400  6
1.4    4    200   400  1
2.4    4     NA    NA  2
3.4    4    200   400  3
4.4    4     NA    NA  4
5.4    4    200   400  5
6.4    4     NA    NA  6
#-----------------

 LContig <- reshape(Contig, direction="long", sep="",
   varying=list(Start=names(Contig)[c(1,3,5,7)], Stop=names(Contig)[c(2,4,6,8)] )  )
 plot(range(c(Contig$Start1, Contig$Stop1) , na.rm=TRUE ), c(1,6),
            type="n", xlab="Segments", ylab="Groups")
  with(LContig, segments(y0=id, y1=id, x0=Start1, x1=Stop1, col=2:7, lwd=3))

enter image description here

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks,it's close, but you're right - without sample data it's hard to tell if it will work for me. I have updated my post accordingly. – user1669785 Sep 14 '12 at 14:35
  • I very much appreciate the work here, I will let you know how it works out. Maybe back with another question or two as the code is not intuitive to me, but will try to muscle through. Thanks again. – user1669785 Sep 14 '12 at 16:32
  • Maybe the perspective of needing to recast data into a "long" form so that it will play nicely with vectorized functions like `segments` has not taken hold of your left-brain. I assure you it is a necessary mental module that you will need as an R-programmer. (And I did notice that my second effort was improperly truncated at 100 on the right-side.) Will fix. – IRTFM Sep 15 '12 at 00:23
2

Here's a version using ggplot2:

# Never forget
options(stringsAsFactors = FALSE)

# Load ggplot2 and reshape2
library(ggplot2)
library(reshape2)


# Read in the data
contig <- read.table(
    text = "id Start1  Stop1   Start2  Stop2   Start3  Stop3   Start4  Stop4
            1  1       90      90      100     120     150     200     400
            2  1       100     120     150     200     400     NA      NA
            3  1       30      90      100     120     135     200     400
            4  1       100     120     140     200     400     NA      NA
            5  -35     80      90      100     130     150     200     400
            6  1       100     200     300     360     400     NA      NA",
    header = TRUE
)


# Reshape it
# Melt it all the way down - each data value is gets a record
# identified by id and variable name
contig.melt <- melt(contig, id.var = "id")

# Your variable names contain two pieces of information:
# whether this point is a start or a stop, and
# which span this point is associated with.
# Much easier to work with those separately, so I'll parse them
# into variables.

# Which span?
contig.melt$span <- gsub(x = contig.melt$variable, 
                         pattern = ".*(\\d)",
                         replace = "\\1")

# Start or stop?
contig.melt$point <- gsub(x = contig.melt$variable, 
                          pattern = "(.*)\\d",
                          replace = "\\1")

# Cast it back into a dataset with a record for each span
contig.long <- dcast(contig.melt, id + span ~ point)


# Plot it. The vertical position and line colors are determined by
# the ID. I'm calling that x here, but I'll flip the coords later
ggplot(contig.long, aes(x = id, color = factor(id))) +

    # geom_linerange plots a line from Start (ymin) to stop (ymax)
    # Control the width of the plot with size
    geom_linerange(aes(ymin = Start, ymax = Stop), size = 2) +

    # Flip the coordinates
    coord_flip() +

    # Make it pretty
    scale_colour_brewer("RNA ID", palette = "Dark2") +
    labs(x = "RNA ID", y = "Position") +
    theme_bw()

ggplot2 plot

Matt Parker
  • 26,709
  • 7
  • 54
  • 72