5

I have a scatter plot with a number of points. Each point has a string associated with it (varying in length) that I'd like to supply a label, but I can't fit them all. So I'd like to iterating through my data points from most to least important, and in each case apply a label only if it would not overlap as existing label. The strings vary in length. One of the commenters mentions solving a knapsack problem to find an optimal solution. In my case the greedy algorithm (always label the most important remaining point that can be labeled without overlap) would be a good start and might suffice.

Here's a toy example. Could I get Python to label only as many points as it can without overlapping?

import matplotlib.pylab as plt, numpy as np

npoints = 100
xs = np.random.rand(npoints)
ys = np.random.rand(npoints)

plt.scatter(xs, ys)

labels = iter(dir(np))
for x, y, in zip(xs, ys):
    # Ideally I'd condition the next line on whether or not the new label would overlap with an existing one
    plt.annotate(labels.next(), xy = (x, y))
plt.show()
kuzzooroo
  • 6,788
  • 11
  • 46
  • 84
  • http://stackoverflow.com/questions/14938541/how-to-improve-the-label-placement-for-matplotlib-scatter-chart-code-algorithm/15859652#15859652 – tacaswell Aug 26 '14 at 23:06
  • in short, no this isn't built in (and finding the optimal set to label strikes me as a variation of the knapsack problem anyway...). You could probably keep track of all the text you have added and then check if the bboxes overlap, however text objects don't know how big they are until they are drawn so this could get very expensive. – tacaswell Aug 26 '14 at 23:08
  • [Here](http://stackoverflow.com/a/4056853/3419103) you might find everything you need to implement such an automatic label placement. But it's not trivial. – Falko Aug 26 '14 at 23:30
  • @tcaswell, you say that drawing the need to draw the text boxes to find out how big they are would be "expensive." Did you mean in terms of computational time? I have code now that labels all the points which takes only a second or two to run. Even my trickiest use case has only a few thousand points. – kuzzooroo Aug 26 '14 at 23:46
  • fair enough re computation time. Text rendering is one of the bottle necks, but I am thinking in terms of trying to do animations/realtime plotting, in terms of human time it's still pretty fast. I should think a bit more before I type. – tacaswell Aug 26 '14 at 23:50

2 Answers2

10

You can draw all the annotates first, and then use a mask array to check the overlap and use set_visible() to hide. Here is an example:

import numpy as np
import pylab as pl
import random
import string
import math
random.seed(0)
np.random.seed(0)
n = 100
labels = ["".join(random.sample(string.ascii_letters, random.randint(4, 10))) for _ in range(n)]
x, y = np.random.randn(2, n)

fig, ax = pl.subplots()

ax.scatter(x, y)

ann = []
for i in range(n):
    ann.append(ax.annotate(labels[i], xy = (x[i], y[i])))

mask = np.zeros(fig.canvas.get_width_height(), bool)

fig.canvas.draw()

for a in ann:
    bbox = a.get_window_extent()
    x0 = int(bbox.x0)
    x1 = int(math.ceil(bbox.x1))
    y0 = int(bbox.y0)
    y1 = int(math.ceil(bbox.y1))

    s = np.s_[x0:x1+1, y0:y1+1]
    if np.any(mask[s]):
        a.set_visible(False)
    else:
        mask[s] = True

the output:

enter image description here

HYRY
  • 94,853
  • 25
  • 187
  • 187
  • 1
    This is great, thank you!! Note for other interested parties: this worked out of the box using IPython via Spyder, but to get it to work in PyCharm I had to add the line `pl.tight_layout()` above `fig.canvas.draw()`. I got the hint from [this answer](http://stackoverflow.com/a/18674635/2829764). – kuzzooroo Aug 27 '14 at 03:40
0

Just as an additional note: for my code to work, I had to add and additional renderer=fig.canvas.get_renderer() parameter to the get_window_extent() method rather than the default get_window_extent(renderer=None). I think the necessity of this additional parameter specification depends on the operating system. https://github.com/matplotlib/matplotlib/issues/10874