Rearanging labels of ggplot scatterplot with the direct labels library in R

Question

I am trying to organize the labels of my ggplot scatterplot so that the labels don't overlap with one another. For this purpose, I am trying to use the direct labels library but I cannot get it to work. When I tried the code:

mytable <- read.csv('http://www.fileden.com/files/2012/12/10/3375236/My%20Documents/CF1_deNovoAssembly.csv', sep=",",  header=TRUE)

mytable$Consensus.length <- log(mytable$Consensus.length)

mytable$Average.coverage <-log(mytable$Average.coverage)

mytable$Name <- do.call(rbind,strsplit(as.character(mytable$Name), " ", '['))[,3]

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + geom_point() + ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + opts(title="Contig Coverage vs Length") + geom_text(hjust=0, vjust=-0.2, size=4)
direct.label(p, "first.qp")

I got this error:

Error in direct.label.ggplot(p, "first.qp") : 
  Need colour aesthetic to infer default direct labels.

So I changed the plotting script by adding aes to the geom_point()

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + geom_point(aes(colour=Average.coverage)) + ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + opts(title="Contig Coverage vs Length") + geom_text(hjust=0, vjust=-0.2, size=4)

And now I get the following error

Error in order.labels(d) : labels are not aligned

I found this thread in which they suggest either placing the labels manually if only a few data points or not at all if too many data points. I agree with this but I will be generating this graph with many different data sets and I do need the data labels. So far this is how the graph looks enter image description here

Are the differences between each label (172 and 165) meaningful? I'm asking because you could use a colour scale based on a cut of these numbers. Breaking them into groups of 10, or 20, for example. If, for example, they represent a geography or something else that is a measurable distance. — Brandon Bertelsen, Dec 11 '12 at 04:11
Another step might be to remove the points, and plot only the numbers (in which case you will want to set `hjust` and `vjust` to 0.5. But I think there is ultimately no way to have all of the labels present, and non-overlapping, and at a large font size - too many of your data points are too close to one another. — Drew Steen, Dec 11 '12 at 04:45
@BrandonBertelsen the differences are not meaningful per se, but I would like to know where 172 and 165 cluster. For instance, I would like to identify which data points cluster in the group of data points between 4.5 and 5.5 in the y axis. — Julio Diaz, Dec 11 '12 at 23:40
@DrewSteen that is an interesting option, could you please advise me as to how to accomplish that — Julio Diaz, Dec 11 '12 at 23:41

score 3 · Answer 1 · answered Dec 12 '12 at 02:59

You could simply remove the points and plot only the labels, which can be accomplished by commenting out the geom_point() part of your plot. (You'll want to change the hjust and vjust values to 0.5, also, so that the center of the label appears where the point would be):

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + 
  #geom_point() + 
  ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + 
  opts(title="Contig Coverage vs Length") + geom_text(hjust=0.5, vjust=0.5, size=4)

There's still some overlap, but perhaps by adjusting the size of the font and the plot it won't be too serious.

enter image description here

Brandon Bertelsen · Accepted Answer · 2012-12-12T04:05:13.767

2

From your comments, it sounds a bit more like a clustering exercise. So, let's go ahead and actually do so:

set.seed(9234970)
d <- data.frame(Name=mytable$Name, 
x=mytable$Consensus.length, 
y=mytable$Average.coverage)
d$kmeans <- as.factor(kmeans(d[-1],20)$cluster)
ggplot(d, aes(x, y, color=kmeans)) + 
geom_point() + 
theme(legend.position="bottom")

kmeans clusters ggplot(d, aes(x, x, label=Name)) + geom_text(aes(x,y)) + facet_wrap(~kmeans, scales="free")

Cluster Breakout

I chose 20 clusters at random

You could also use heirarchical clustering to see a dendogram.

plot(hclust(dist(d[-3]))) # -3 drops kmeans column

I'd recommend playing around with the cluster package in general as it may provide a more useful solution to your problem.

edited Dec 12 '12 at 04:05

answered Dec 12 '12 at 03:59

Brandon Bertelsen

43,807
34
160
255

Thanks and very interesting solution. I am guessing the clustering algorithm employs values in the x and y axis. Is there a way to cluster the data only employing the y axis values. – Julio Diaz Dec 13 '12 at 16:52
You would do the same thing but base your clusters on `as.factor(kmeans(d$y,20)$cluster)` – Brandon Bertelsen Dec 13 '12 at 18:00

Rearanging labels of ggplot scatterplot with the direct labels library in R

2 Answers2