Generate multiple serial graphs/scatterplots from data in two dataframes

Question

I have 2 dataframes, Tg and Pf, each of 127 columns. All columns have at least one row and can have up to thousands of them. All the values are between 0 and 1 and there are some missing values (empty cells). Here is a little subset:

Tg
Tg1 Tg2 Tg3 ... Tg127
0.9 0.5 0.4     0
0.9 0.3 0.6     0
0.4 0.6 0.6     0.3
0.1 0.7 0.6     0.4
0.1 0.8
0.3 0.9
    0.9
    0.6
    0.1

Pf
Pf1 Pf2 Pf3 ...Pf127
0.9 0.5 0.4    1
0.9 0.3 0.6    0.8 
0.6 0.6 0.6    0.7
0.4 0.7 0.6    0.5
0.1     0.6    0.5
0.3
0.3
0.3

Note that some cell are empty and the vector lengths for the same subset (i.e. 1 to 127) can be of very different length and are rarely the same exact length. I want to generate 127 graph as follow for the 127 vectors (i.e. graph is for col 1 from each dataframe, graph 2 is for col 2 for each dataframe etc...):

enter image description here

Hope that makes sense. I'm looking forward to your assistance as I don't want to make those graphs one by one... Thanks!

John Colby · Accepted Answer · 2011-11-09T00:25:30.233

4

Here is an example to get you started (data at https://gist.github.com/1349300). For further tweaking, check out the excellent ggplot2 documentation that is all over the web.

library(ggplot2)

# Load data
Tg = read.table('Tg.txt', header=T, fill=T, sep=' ')
Pf = read.table('Pf.txt', header=T, fill=T, sep=' ')

# Format data
Tg$x        = as.numeric(rownames(Tg))
Tg          = melt(Tg, id.vars='x')
Tg$source   = 'Tg'
Tg$variable = factor(as.numeric(gsub('Tg(.+)', '\\1', Tg$variable)))

Pf$x        = as.numeric(rownames(Pf))
Pf          = melt(Pf, id.vars='x')
Pf$source   = 'Pf'
Pf$variable = factor(as.numeric(gsub('Pf(.+)', '\\1', Pf$variable)))

# Stack data
data = rbind(Tg, Pf)

# Plot
dev.new(width=5, height=4)
p = ggplot(data=data, aes(x=x)) + geom_line(aes(y=value, group=source, color=source)) + facet_wrap(~variable)
p

enter image description here

Highlighting the area between the lines

First, interpolate the data onto a finer grid. This way the ribbon will follow the actual envelope of the lines, rather than just where the original data points were located.

data = ddply(data, c('variable', 'source'), function(x) data.frame(approx(x$x, x$value, xout=seq(min(x$x), max(x$x), length.out=100))))
names(data)[4] = 'value'

Next, calculate the data needed for geom_ribbon - namely ymax and ymin.

ribbon.data = ddply(data, c('variable', 'x'), summarize, ymin=min(value), ymax=max(value))

Now it is time to plot. Notice how we've added a new ribbon layer, for which we've substituted our new ribbon.data frame.

dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax),  alpha=0.3, data=ribbon.data)

enter image description here

Dynamic coloring between the lines

The trickiest variation is if you want the coloring to vary based on the data. For that, you currently must create a new grouping variable to identify the different segments. Here, for example, we might use a function that indicates when the "Tg" group is on top:

GetSegs <- function(x) {
  segs = x[x$source=='Tg', ]$value > x[x$source=='Pf', ]$value
  segs.rle = rle(segs)

  on.top = ifelse(segs, 'Tg', 'Pf')
  on.top[is.na(on.top)] = 'Tg'

  group = rep.int(1:length(segs.rle$lengths), times=segs.rle$lengths)
  group[is.na(segs)] = NA

  data.frame(x=unique(x$x), group, on.top)
}

Now we apply it and merge the results back with our original ribbon data.

groups = ddply(data, 'variable', GetSegs)
ribbon.data = join(ribbon.data, groups)

For the plot, the key is that we now specify a grouping aesthetic to the ribbon geom.

dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax, group=group, fill=on.top),  alpha=0.3, data=ribbon.data)

enter image description here

Code is available together at: https://gist.github.com/1349300

edited Nov 09 '11 at 00:25

answered Nov 08 '11 at 21:26

John Colby

22,169
4
57
69

I heard that ggplot2 was powerful, but that is just flabbergasting! That is just exactly what I need. I will play around with the display layout and colors.Would it be possible to color the region between the 2 lines? – Olivier Nov 08 '11 at 21:36
2

great ggplot2 illustration. Olivier, note how @John Colby made his data "tall" by stacking the data. This step is the source of MUCH confusion/frustration when folks start using ggplot2. – JD Long Nov 08 '11 at 21:39
@Olivier Great, I'm glad it helped! ggplot2 is indeed very impressive, and this is a good example that it can capture the entire graph description in a one-liner. JD makes an *excellent* point too, that the key to happiness with ggplot2 is getting familiar with how to reshape your data. – John Colby Nov 08 '11 at 21:45
2

@Olivier It gets a bit trickier when you want to fill in between 2 lines that cross each other like this. I think currently you must make an additional grouping variable that identifies each segment. Here is one similar Q/A that is a good reference: http://stackoverflow.com/questions/7883154/how-do-i-fill-a-geom-area-plot-using-ggplot/7883556#7883556. I'll post relevant code for this example too when I have a moment. – John Colby Nov 08 '11 at 22:00
That is just beautiful. Period. I had not dare to ask for the color coding of the shade actually. What do you mean by making the data look "tall"? – Olivier Nov 09 '11 at 03:18
@JohnColby : I can get to plot them and was wondering if I can resize the x-axis for each graph so that each fill the whole axis for each panel? See [link](https://plus.google.com/u/0/photos/112804714833136314783/albums/5673003589672921297/5673003590651831314) – Olivier Nov 09 '11 at 14:35
1

Yep... Remove the NAs at the end with something like `data = data[!is.na(data$value), ]` and `ribbon.data = ribbon.data[!is.na(ribbon.data$ymax), ]`. Then add a `scales='free'` optional parameter to the first ggplot call and redo the plots. The x limits will now all be flush right. – John Colby Nov 09 '11 at 16:30
Haha...thanks, guys! I banged my head against the wall for a whole week on this last year when I had to color-code plots of t-stats vs. x, depending on whether their p-value was significant. Very happy to share so others don't have to do the same! – John Colby Nov 09 '11 at 18:27
@ John : I added data = data[!is.na(data$value), ] after dev.new() and ribbon.data = ribbon.data[!is.na(ribbon.data$ymax), ] after ribbon.data = join(ribbon.data, groups) and added the scales='free' in facet_wrap() in the first ggplot call. That solves the scaling issues but the ribbon has now a third category (NA) that it color codes purple. – Olivier Nov 09 '11 at 18:31
@Olivier Check out the new code in the gist link. That is what I'm using, and it's not giving me any NA levels. Does that work for you too? – John Colby Nov 09 '11 at 18:36
@ John : works great with simulated data but not with my real set. That's OK I can do without ribbons, it is just not as nice. I'll go back to it in a few days and revisit the issue. Thanks for the tremendous help! – Olivier Nov 10 '11 at 17:25

score 2 · Answer 2 · answered Nov 08 '11 at 22:23

Here is a three-liner to do the same :-). We first reshape from base to convert the data into long form. Then, it is melted to suit ggplot2. Finally, we generate the plot!

mydf   <- reshape(cbind(Tg, Pf), varying = 1:8, direction = 'long', sep = "")
mydf_m <- melt(mydf, id.var = c(1, 4), variable = 'source') 
qplot(id, value, colour = source, data = mydf_m, geom = 'line') + 
  facet_wrap(~ time, ncol = 2)

NOTE. The reshape function in base R is extremely powerful, albeit very confusing to use. It is used to transform data between long and wide formats.

Very minimalist code and it works too! I am now sold on learning ggplot2 ! — Olivier, Nov 09 '11 at 03:22

JD Long · Answer 3 · 2011-11-09T16:12:35.057

1

Kudos for automating something you used to do in Excel using R! That's exactly how I got started with R and a common path to R enlightenment :)

All you really need is a little looping. Here's an example, most of which is creating example data that represents your data structure:

## create some example data

Tg <- data.frame(Tg1 = rnorm(10))
for (i in 2:10) {
  vec <- rep(NA, 8)
  vec <- c(rnorm(sample(5:10,1)), vec)
  Tg[paste("Tg", i, sep="")] <- vec[1:10]

}

Pf <- data.frame(Pf1 = rnorm(10))
for (i in 2:10) {
  vec <- rep(NA, 8)
  vec <- c(rnorm(sample(5:10,1)), vec)
  Pf[paste("Pf", i, sep="")] <- vec[1:10]

}
## ok, sample data created

## now lets loop through all the columns
## if you didn't know how many columns there are you could 
## use ncol(Tg) to figure out

for (i in 1:10) {
  plot(1:10, Tg[,i], type = "l", col="blue", lwd=5, ylim=c(-3,3), 
     xlim=c(1, max(length(na.omit(Tg[,i])), length(na.omit(Pf[,i])))))
  lines(1:10, Pf[,i], type = "l", col="red", lwd=5, ylim=c(-3,3))
  dev.copy(png, paste('rplot', i, '.png', sep=""))
  dev.off()
}

This will result in 10 graphs in your working directory that look like the following:

enter image description here

edited Nov 09 '11 at 16:12

answered Nov 08 '11 at 21:28

JD Long

59,675
58
202
294

Indeed, you are reading my mind! I can see that it needs to be scripted but I have no clue on how to do it. Playing around with your code until I get it will be of great instructional help. Thanks a bunch! I'll keep reading R books and play around. Glad to hear you started like that! If you have any generic pointers on what made you successful, I'm all ears :) – Olivier Nov 08 '11 at 21:35
1

tenacity. Everything gives up under unrelenting perspiration. Skill is over rated :) – JD Long Nov 08 '11 at 21:41
LOL. I was hoping for a silver bullet ;) but brute force often does it. – Olivier Nov 09 '11 at 03:23
Any suggestion on being able to change the X axis so that its max is the max value of the vector Tg or Pf, whichever is longer. Indeed in my real dataset, x values go from 200 to 10,000 and plotting all on a 10,000 axis compress the shorter vectors. I'm playing with your code in the meantime. – Olivier Nov 09 '11 at 15:30
@ JD : I thought the following would work, but somehow it still plots to 10: for (i in 1:ncol(Tg)) { plot(1:length(Tg[,i]), Tg[,i], type = "l", col="blue", lwd=5, ylim=c(-3,3)) lines(1:length(Pf[,i]), Pf[,i], type = "l", col="red", lwd=5, ylim=c(-3,3)) dev.copy(png, paste('rplot', i, '.png', sep="")) dev.off() } – Olivier Nov 09 '11 at 15:47
@Olivier I updated the code above adding an `xlim` parameter. You want to set the xlim to the max of the lengths of each vector. But NA values have to be excluded or the length will include the NAs which you don't want. – JD Long Nov 09 '11 at 16:13
@ JD: How do you write code in the comments? I've tried the 4 spaces and looking it up but can't figure it out... Thanks! – Olivier Nov 09 '11 at 22:19

Generate multiple serial graphs/scatterplots from data in two dataframes

3 Answers3

Linked