only writing visible points to disk of an overplotted scatterplot

Question

I am creating matplotlib scatterplots of around 10000 points. At the point size I am using, this results in overplotting, i.e. some of the points will be hidden by the points that are plotted over them.

While I don't mind about the fact that I cannot see the hidden points, they are redundantly written out when I write the figure to disk as pdf (or other vector format), resulting in a large file.

Is there a way to create a vector image where only the visible points would be written to the file? This would be similar to the concept of "flattening" / merging layers in photo editing software. (I still like to retain the image as vector, as I would like to have the ability to zoom in).

Example plot:

import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
random.seed(15)

df = pd.DataFrame({'x': np.random.normal(10, 1.2, 10000), 
                   'y': np.random.normal(10, 1.2, 10000), 
                   'color' : np.random.normal(10, 1.2, 10000)})
df.plot(kind = "scatter", x = "x", y = "y", c = "color", s = 80, cmap = "RdBu_r")
plt.show()

When you save the figure you can reduce the quality of the image by the `dpi` keyword. Actually, this is not a direct answer to your question, but if the problem is the weight of your file, this could be an easy solution. — Alessandro Peca, Mar 23 '20 at 17:48
@AlessandroPeca but then it's not a vector graphic anymore, is it? — user8408080, Mar 23 '20 at 18:02
DPI is meaningless for vector output. Concerning the problem, it will be expensive to compute. You would convert your data to units of inches (or points), such that you can do arithmetics with the radius of sqrt(80)/2 in units of points. Then you need to find an inexpensive way to define "overlap" that would not require to check each combination of one point with every other point. — ImportanceOfBeingErnest, Mar 23 '20 at 18:05
@ImportanceOfBeingErnest It seems that that would be a good solution. Given that I am facing this problem quite often, I was wondering if this is already implemented somewhere? — MartijnVanAttekum, Mar 23 '20 at 19:00

JohanC · Answer 1 · 2020-03-23T21:20:45.367

1

My best guess would be to use a hexbin. Note that with a scatter plot, the dots that are plotted latest will be the only ones visible. With a hexbin, all coinciding dots will be averaged.

If interested, the centers of the hexagons can be used to again create a scatter plot only showing the minimum.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(15)
df = pd.DataFrame({'x': np.random.normal(10, 1.2, 10000),
                   'y': np.random.normal(10, 1.2, 10000),
                   'color': np.random.normal(10, 1.2, 10000)})

fig, ax = plt.subplots(ncols=4, gridspec_kw={'width_ratios': [10,10,10,1]})

norm = plt.Normalize(df.color.min(), df.color.max())
df.plot(kind="scatter", x="x", y="y", c="color", s=10, cmap="RdBu_r", norm=norm, colorbar=False, ax=ax[0])

hexb = ax[1].hexbin(df.x, df.y, df.color, cmap="RdBu_r", norm=norm, gridsize=80)

centers = hexb.get_offsets()
values = hexb.get_array()
ax[2].scatter(centers[:,0], centers[:,1], c=values, s=10, cmap="RdBu_r", norm=norm)

plt.colorbar(hexb, cax=ax[3])
plt.show()

Here is another comparison. The number of dots is reduced with a factor of 10, and the plot is more "honest" as overlapping dots are averaged.

edited Mar 23 '20 at 21:20

answered Mar 23 '20 at 18:20

JohanC

71,591
8
33
66

Did this answer shed some light on your question? – JohanC Mar 24 '20 at 19:17
if you create a grid, you will either loose the intensity values (here represented by color) or an impression of the density of the points in the different regions. In this case, where you summarize by the intensity, you loose density. Therefore, the "removing hidden points" solution as ImportanceOfBeingErnest suggests would be better, but is not implemented anywhere. I am still thinking if there is a way to compute whether a circle is completely hidden. – MartijnVanAttekum Mar 25 '20 at 14:04
Yes, you loose density, as that is what you're asking: to draw less circles. But as those resulting circles slightly overlap, the backgroud stays covered at the same spots. Also note that scatter creates circles in screen space, while hexbin creates the hexagons in data space. The scater plot will look different if you zoom in. – JohanC Mar 25 '20 at 14:29
The problem is not so much that the total density will decrease, but that with the hexbin method, the _visible_ density will equalized over the whole plot, meaning that you loose the possibility to get an impression on which regions of the plot are denser than others. – MartijnVanAttekum Apr 02 '20 at 10:03
The other answer is a clear case of severe over-engineering. Lots of work to end up with about the same image size (in the case of 10000 dots). Is your real use case about colors that look random and you want to show chaotic colors to simulate density? You might want to draw a heatmap below or along your plot. – JohanC Apr 02 '20 at 10:26

score 1 · Accepted Answer · answered Mar 29 '20 at 16:06

tl;dr

I don't know of any simple solution such as

RemoveOccludedCircles(C)

The algorithm below requires some implementation, but it shouldn't be too bad.

Problem reformulation

While we could try to remove existing circles when adding new ones, I find it easier to think about the problem the other way round, processing all circles in reverse order and pretending to draw each new circle behind the existing ones.

The main problem then becomes: How can I efficiently determine whether one circle would be completely hidden by another set of circles?

Conditions

In the following, I will describe an algorithm for the case where the circles are sorted by size, such that larger circles are placed behind smaller circles. This includes the special case where all circles have same size. An extension to the general case would actually be significantly more complicated as one would have to maintain a triangulation of the intersection points. In addition, I will make the assumption that no two circles have the exact same properties (radius and position). These identical circles could easily be filtered.

Datastructures

C: A set of visible circles

P: A set of control points

Control points will be placed in such a way that no newly placed circle can become visible unless either its center lies outside the existing circles or at least one control point falls inside the new circle.

Problem visualisation

In order to better understand the role of control poins, their maintenance and the algorithm, have a look at the following drawing: Processing 6 circles

In the linked image, active control points are painted in red. Control points that are removed after each step are painted in green or blue, where blue points were created by computing intersections between circles.

In image g), the green area highlights the region in which the center of a circle of same size could be placed such that the corresponding circle would be occluded by the existing circles. This area was derived by placing circles on each control point and subtracting the resulting area from the area covered by all visible circles.

Control point maintenance

Whenever adding one circle to the canvas, we add four active points, which are placed on the border of the circle in an equidistant way. Why four? Because no circle of same or bigger size can be placed with its center inside the current circle without containing one of the four control points.

After placing one circle, the following assumption holds: A new circle is completely hidden by existing circles if

Its center falls into a visible circle.
No control point lies strictly inside the new circle.

In order to maintain this assumption while adding new circles, the set of control points needs to be updated after each addition of a visible circle:

Add 4 new control points for the new circle, as described before.
Add new control points at each intersection of the new circle with existing visible circles.
Remove all control points that lie strictly inside any visible circle.

This rule will maintain control points at the outer border of the visible circles in such a dense way that no new visible circle intersecting the existing circles can be placed without 'eating' at least one control point.

Pseudo-Code

AllCircles <- All circles, sorted from front to back
C <- {} // the set of visible circles
P <- {} // the set of control points
for X in AllCircles {
  if (Inside(center(X), C) AND Outside(P, X)) {
    // ignore circle, it is occluded!
  } else {
    C <- C + X
    P <- P + CreateFourControlPoints(X)
    P <- P + AllCuttingPoints(X, C)
    RemoveHiddenControlPoints(P, C)
  }
}
DrawCirclesInReverseOrder(C)

The functions 'Inside' and 'Outside' are a bit abstract here, as 'Inside' returns true if a point is contained in one or more circles from a seto circles and 'Outside' returns true if all points from a set of points lie outside of a circle. But none of the functions used should be hard to write out.

Minor problems to be solved

How to determine in a numerically stable way whether a point is strictly inside a circle? -> This shouldn't be too bad to solve as all points are never more complicated than the solution of a quadratic equation. It is important, though, to not rely solely on floating point representations as these will be numerically insufficient and some control points would probable get completely lost, effectively leaving holes in the final plot. So keep a symbolic and precise representation of the control point coordinates. I would try SymPy to tackle this problem as it seems to cover all the required math. The formula for intersecting circles can easily be found online, for example here.
How to efficiently determine whether a circle contains any control point or any visible circle contains the center of a new circle? -> In order to solve this, I would propose to keep all elements of P and C in grid-like structures, where the width and height of each grid element equals the radius of the circles. On average, the number of active points and visible circles per grid cell should be in O(1), although it is possible to contruct artificial setups with arbitrary amounts of elements per grid cell, which would turn the overall algorithm from O(N) to O(N * N).

Runtime thoughts

As mentioned above, I would expect the runtime to scale linearly with the number of circles on average, because the number of visible circles in each grid cell will be in O(N) unless constructed in an evil way.

The data structures should be easily maintainable in memory if the circle radius isn't excessively small and computing intersections between circles should also be quite fast. I'm curious about final computational time, but I don't expect that it would be much worse than drawing all circles in a naive way a single time.