41

I use matplotlib to plot a scatter chart:

enter image description here

And label the bubble using a transparent box according to the tip at How to annotate point on a scatter automatically placed arrow

Here is the code:

if show_annote:
    for i in range(len(x)):
        annote_text = annotes[i][0][0]  # STK_ID
        ax.annotate(annote_text, xy=(x[i], y[i]), xytext=(-10,3),
            textcoords='offset points', ha='center', va='bottom',
            bbox=dict(boxstyle='round,pad=0.2', fc='yellow', alpha=0.2),
            fontproperties=ANNOTE_FONT) 

and the resulting plot: enter image description here

But there is still room for improvement to reduce overlap (for instance the label box offset is fixed as (-10,3)). Are there algorithms that can:

  1. dynamically change the offset of label box according to the crowdedness of its neighbourhood
  2. dynamically place the label box remotely and add an arrow line beween bubble and label box
  3. somewhat change the label orientation
  4. label_box overlapping bubble is better than label_box overlapping label_box?

I just want to make the chart easy for human eyes to comprehand, so some overlap is OK, not as rigid a constraint as http://en.wikipedia.org/wiki/Automatic_label_placement suggests. And the bubble quantity within the chart is less than 150 most of the time.

I find the so called Force-based label placement http://bl.ocks.org/MoritzStefaner/1377729 is quite interesting. I don't know if there is any python code/package available to implement the algorithm.

I am not an academic guy and not looking for an optimum solution, and my python codes need to label many charts, so the speed/memory is in the scope of consideration.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
bigbug
  • 55,954
  • 42
  • 77
  • 96

5 Answers5

29

Another option using my library adjustText, written specially for this purpose (https://github.com/Phlya/adjustText).

from adjustText import adjust_text
np.random.seed(2016)

N = 50
scatter_data = np.random.rand(N, 3)
fig, ax = plt.subplots()
ax.scatter(scatter_data[:, 0], scatter_data[:, 1],
           c=scatter_data[:, 2], s=scatter_data[:, 2] * 150)
labels = ['ano_{}'.format(i) for i in range(N)]
texts = []
for x, y, text in zip(scatter_data[:, 0], scatter_data[:, 1], labels):
    texts.append(ax.text(x, y, text))
plt.show()

enter image description here

np.random.seed(2016)

N = 50
scatter_data = np.random.rand(N, 3)
fig, ax = plt.subplots()
ax.scatter(scatter_data[:, 0], scatter_data[:, 1],
           c=scatter_data[:, 2], s=scatter_data[:, 2] * 150)
labels = ['ano_{}'.format(i) for i in range(N)]
texts = []
for x, y, text in zip(scatter_data[:, 0], scatter_data[:, 1], labels):
    texts.append(ax.text(x, y, text))
adjust_text(texts, force_text=0.05, arrowprops=dict(arrowstyle="-|>",
                                                    color='r', alpha=0.5))
plt.show()

enter image description here

It doesn't repel from the bubbles, only from their centers and other texts.

Phlya
  • 5,726
  • 4
  • 35
  • 54
  • 1
    To whom it may concern... I had a plot with around 200 labels and the default settings lead to long render times. Set parameter `lim=20` i.e. to quickly iterate (default is 500). Supercool tool btw! Thanks very much for making this available. – petezurich Sep 30 '18 at 13:42
28

The following builds on tcaswell's answer.

Networkx layout methods such as nx.spring_layout rescale the positions so that they all fit in a unit square (by default). Even the position of the fixed data_nodes are rescaled. So, to apply the pos to the original scatter_data, an unshifting and unscaling must be performed.

Note also that nx.spring_layout has a k parameter which controls the optimal distance between nodes. As k increases, so does the distance of the annotations from the data points.

import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
np.random.seed(2016)

N = 20
scatter_data = np.random.rand(N, 3)*10


def repel_labels(ax, x, y, labels, k=0.01):
    G = nx.DiGraph()
    data_nodes = []
    init_pos = {}
    for xi, yi, label in zip(x, y, labels):
        data_str = 'data_{0}'.format(label)
        G.add_node(data_str)
        G.add_node(label)
        G.add_edge(label, data_str)
        data_nodes.append(data_str)
        init_pos[data_str] = (xi, yi)
        init_pos[label] = (xi, yi)

    pos = nx.spring_layout(G, pos=init_pos, fixed=data_nodes, k=k)

    # undo spring_layout's rescaling
    pos_after = np.vstack([pos[d] for d in data_nodes])
    pos_before = np.vstack([init_pos[d] for d in data_nodes])
    scale, shift_x = np.polyfit(pos_after[:,0], pos_before[:,0], 1)
    scale, shift_y = np.polyfit(pos_after[:,1], pos_before[:,1], 1)
    shift = np.array([shift_x, shift_y])
    for key, val in pos.items():
        pos[key] = (val*scale) + shift

    for label, data_str in G.edges():
        ax.annotate(label,
                    xy=pos[data_str], xycoords='data',
                    xytext=pos[label], textcoords='data',
                    arrowprops=dict(arrowstyle="->",
                                    shrinkA=0, shrinkB=0,
                                    connectionstyle="arc3", 
                                    color='red'), )
    # expand limits
    all_pos = np.vstack(pos.values())
    x_span, y_span = np.ptp(all_pos, axis=0)
    mins = np.min(all_pos-x_span*0.15, 0)
    maxs = np.max(all_pos+y_span*0.15, 0)
    ax.set_xlim([mins[0], maxs[0]])
    ax.set_ylim([mins[1], maxs[1]])

fig, ax = plt.subplots()
ax.scatter(scatter_data[:, 0], scatter_data[:, 1],
           c=scatter_data[:, 2], s=scatter_data[:, 2] * 150)
labels = ['ano_{}'.format(i) for i in range(N)]
repel_labels(ax, scatter_data[:, 0], scatter_data[:, 1], labels, k=0.008)

plt.show()

with k=0.011 yields

enter image description here and with k=0.008 yields enter image description here

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I had to change pos.iteritems() in the for loop to pos.items(). I'm using Python 3.5.2 and networkx v1.11. – equant Aug 19 '16 at 21:47
  • I am getting `FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future. all_pos = np.vstack(pos.values())` Could you please advise how that can be resolved? – Slartibartfast Jul 05 '20 at 20:02
22

It is a little rough around the edges (I can't quite figure out how to scale the relative strengths of the spring network vs the repulsive force, and the bounding box is a bit screwed up), but this is a decent start:

import networkx as nx

N = 15
scatter_data = rand(3, N)
G=nx.Graph()

data_nodes = []
init_pos = {}
for j, b in enumerate(scatter_data.T):
    x, y, _ = b
    data_str = 'data_{0}'.format(j)
    ano_str = 'ano_{0}'.format(j)
    G.add_node(data_str)
    G.add_node(ano_str)
    G.add_edge(data_str, ano_str)
    data_nodes.append(data_str)
    init_pos[data_str] = (x, y)
    init_pos[ano_str] = (x, y)

pos = nx.spring_layout(G, pos=init_pos, fixed=data_nodes)
ax = gca()
ax.scatter(scatter_data[0], scatter_data[1], c=scatter_data[2], s=scatter_data[2]*150)

for j in range(N):
    data_str = 'data_{0}'.format(j)
    ano_str = 'ano_{0}'.format(j)
    ax.annotate(ano_str,
                xy=pos[data_str], xycoords='data',
                xytext=pos[ano_str], textcoords='data',
                arrowprops=dict(arrowstyle="->",
                                connectionstyle="arc3"))

all_pos = np.vstack(pos.values())
mins = np.min(all_pos, 0)
maxs = np.max(all_pos, 0)

ax.set_xlim([mins[0], maxs[0]])
ax.set_ylim([mins[1], maxs[1]])

draw()

sample image

How well it works depends a bit on how your data is clustered.

tacaswell
  • 84,579
  • 22
  • 210
  • 199
2

Just created another quick solution that is also very fast: textalloc

In this case you could do something like this:

import textalloc as ta
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2022)
N = 30
scatter_data = np.random.rand(N, 3)*10

fig, ax = plt.subplots()
ax.scatter(scatter_data[:, 0], scatter_data[:, 1], c=scatter_data[:, 2], s=scatter_data[:, 2] * 50, zorder=10,alpha=0.5)
labels = ['ano-{}'.format(i) for i in range(N)]
text_list = labels = ['ano-{}'.format(i) for i in range(N)]
ta.allocate_text(fig,ax,scatter_data[:, 0],scatter_data[:, 1],
            text_list,
            x_scatter=scatter_data[:, 0], y_scatter=scatter_data[:, 1],
            max_distance=0.2,
            min_distance=0.04,
            margin=0.039,
            linewidth=0.5,
            nbr_candidates=400)
plt.show()

scatterplot

ckjellson
  • 31
  • 3
0

We can use plotly for this. But we can't help placing overlap correctly if there is lot of data. Instead we can zoom in and zoom out.

import plotly.express as px
df = px.data.tips()

df = px.data.gapminder().query("year==2007 and continent=='Americas'")


fig = px.scatter(df, x="gdpPercap", y="lifeExp", text="country", log_x=True, size_max=100, color="lifeExp",
                 title="Life Expectency")
fig.update_traces(textposition='top center')

fig.show()

Output:

enter image description here

bigbounty
  • 16,526
  • 5
  • 37
  • 65