4

I am trying to organize the labels of my ggplot scatterplot so that the labels don't overlap with one another. For this purpose, I am trying to use the direct labels library but I cannot get it to work. When I tried the code:

mytable <- read.csv('http://www.fileden.com/files/2012/12/10/3375236/My%20Documents/CF1_deNovoAssembly.csv', sep=",",  header=TRUE)

mytable$Consensus.length <- log(mytable$Consensus.length)

mytable$Average.coverage <-log(mytable$Average.coverage)

mytable$Name <- do.call(rbind,strsplit(as.character(mytable$Name), " ", '['))[,3]

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + geom_point() + ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + opts(title="Contig Coverage vs Length") + geom_text(hjust=0, vjust=-0.2, size=4)
direct.label(p, "first.qp")

I got this error:

Error in direct.label.ggplot(p, "first.qp") : 
  Need colour aesthetic to infer default direct labels.

So I changed the plotting script by adding aes to the geom_point()

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + geom_point(aes(colour=Average.coverage)) + ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + opts(title="Contig Coverage vs Length") + geom_text(hjust=0, vjust=-0.2, size=4)

And now I get the following error

Error in order.labels(d) : labels are not aligned

I found this thread in which they suggest either placing the labels manually if only a few data points or not at all if too many data points. I agree with this but I will be generating this graph with many different data sets and I do need the data labels. So far this is how the graph looks enter image description here

Community
  • 1
  • 1
Julio Diaz
  • 9,067
  • 19
  • 55
  • 70
  • 1
    Are the differences between each label (172 and 165) meaningful? I'm asking because you could use a colour scale based on a cut of these numbers. Breaking them into groups of 10, or 20, for example. If, for example, they represent a geography or something else that is a measurable distance. – Brandon Bertelsen Dec 11 '12 at 04:11
  • Another step might be to remove the points, and plot only the numbers (in which case you will want to set `hjust` and `vjust` to 0.5. But I think there is ultimately no way to have all of the labels present, and non-overlapping, and at a large font size - too many of your data points are too close to one another. – Drew Steen Dec 11 '12 at 04:45
  • @BrandonBertelsen the differences are not meaningful per se, but I would like to know where 172 and 165 cluster. For instance, I would like to identify which data points cluster in the group of data points between 4.5 and 5.5 in the y axis. – Julio Diaz Dec 11 '12 at 23:40
  • @DrewSteen that is an interesting option, could you please advise me as to how to accomplish that – Julio Diaz Dec 11 '12 at 23:41
  • I am encountering an indentical problem – MartinT Jul 03 '15 at 13:32
  • It is not clear to me what the question is. – randy Jun 14 '23 at 00:08

2 Answers2

3

You could simply remove the points and plot only the labels, which can be accomplished by commenting out the geom_point() part of your plot. (You'll want to change the hjust and vjust values to 0.5, also, so that the center of the label appears where the point would be):

ggplot(mytable, aes(x=Consensus.length, y=Average.coverage, label=Name)) + 
  #geom_point() + 
  ylab("Contig Average Coverage (log)") + xlab("Contig Consensus Length (log)") + 
  opts(title="Contig Coverage vs Length") + geom_text(hjust=0.5, vjust=0.5, size=4)

There's still some overlap, but perhaps by adjusting the size of the font and the plot it won't be too serious.

enter image description here

Drew Steen
  • 16,045
  • 12
  • 62
  • 90
2

From your comments, it sounds a bit more like a clustering exercise. So, let's go ahead and actually do so:

set.seed(9234970)
d <- data.frame(Name=mytable$Name, 
x=mytable$Consensus.length, 
y=mytable$Average.coverage)
d$kmeans <- as.factor(kmeans(d[-1],20)$cluster)
ggplot(d, aes(x, y, color=kmeans)) + 
geom_point() + 
theme(legend.position="bottom")

kmeans clusters ggplot(d, aes(x, x, label=Name)) + geom_text(aes(x,y)) + facet_wrap(~kmeans, scales="free")

Cluster Breakout

I chose 20 clusters at random

You could also use heirarchical clustering to see a dendogram.

plot(hclust(dist(d[-3]))) # -3 drops kmeans column

I'd recommend playing around with the cluster package in general as it may provide a more useful solution to your problem.

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • Thanks and very interesting solution. I am guessing the clustering algorithm employs values in the x and y axis. Is there a way to cluster the data only employing the y axis values. – Julio Diaz Dec 13 '12 at 16:52
  • You would do the same thing but base your clusters on `as.factor(kmeans(d$y,20)$cluster)` – Brandon Bertelsen Dec 13 '12 at 18:00