1

I'm plotting a simple 2D density map obtained with scipy.stats.gaussian_kde. There is always a plotting artifact towards the edges where the density appears to be lower:

enter image description here

I've tried every interpolation method in imshow() and none seems to be able to get rid of it. Is there a proper way to handle this?

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

x_data = np.random.uniform(1., 2000., 1000)
y_data = np.random.uniform(1., 2000., 1000)
xmin, xmax = np.min(x_data), np.max(x_data)
ymin, ymax = np.min(y_data), np.max(y_data)
values = np.vstack([x_data, y_data])

# Gaussian KDE.
kernel = stats.gaussian_kde(values, bw_method=.2)
# Grid density (number of points).
gd_c = complex(0, 50)
# Define x,y grid.
x_grid, y_grid = np.mgrid[xmin:xmax:gd_c, ymin:ymax:gd_c]
positions = np.vstack([x_grid.ravel(), y_grid.ravel()])
# Evaluate kernel in grid positions.
k_pos = kernel(positions)

ext_range = [xmin, xmax, ymin, ymax]
kde = np.reshape(k_pos.T, x_grid.shape)
im = plt.imshow(np.rot90(kde), cmap=plt.get_cmap('RdYlBu_r'), extent=ext_range)

plt.show()
Gabriel
  • 40,504
  • 73
  • 230
  • 404
  • I don't think it's clear what exactly is wrong with the output. – ImportanceOfBeingErnest Jan 12 '20 at 15:38
  • Nothing's really "wrong" with the output. But the density of the x,y points is not really lower towards the edges, that's an artifact of the plotting method. I'm asking if there's a way to "fix" this. – Gabriel Jan 12 '20 at 15:43
  • The edge of your plot is rather bluish, so the density *is* lower. – ImportanceOfBeingErnest Jan 12 '20 at 15:57
  • It's just a random uniform distribution of points in (x,y), it's not lower. It *looks* lower because the KDE sums Gaussian contributions and points close to the edges have no contribution from beyond the edges. – Gabriel Jan 12 '20 at 15:58
  • It's a KDE, so naturally there will be lower density around the edges since no gaussians can be placed e.g. lower that the lowest sample, so you don't get a benefit from neighboring gaussians. You can change the bandwidth to lessen this effect, but that will also make the density in the middle more spiky. You could make a bit of a hack and add some points with a low weight outside the range. You could treat the location and weight of these points as a separate optimization problem. – user2653663 Jan 12 '20 at 16:04
  • So the question is, why a KDE fall off at the edges? That's simply a mathematical consequence of the infinite support. – ImportanceOfBeingErnest Jan 12 '20 at 16:06
  • Yes user26.. that's exactly right. I was just wondering if there was a way to compensate for this effect at the plotting level. – Gabriel Jan 12 '20 at 16:09
  • No, there isn't. Matplotlib will always plot the data you give to it. – ImportanceOfBeingErnest Jan 12 '20 at 16:11
  • Related (but not quite duplicate): https://stackoverflow.com/questions/59649413/violin-plot-for-positive-values-with-python – Diziet Asahi Jan 12 '20 at 20:43

1 Answers1

3

After a while I found a way to address this issue applying a neat trick explained by Flabetvibes in this excellent answer.

I use the code shown there to mirror the data as shown in the first figure of the mentioned answer. The only modification I introduced is to trim the mirrored data to a perc padding (I set it to 10% by default) so as not to carry around a lot of unnecessary values.

The result is shown here, original non-mirrored data to the left, and mirrored data to the right:

enter image description here

As can be seen, the changes in the resulting density map are not trivial. I personally believe the mirrored-data KDE represents the actual density better.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def in_box(towers, bounding_box):
    return np.logical_and(np.logical_and(bounding_box[0] <= towers[:, 0],
                                         towers[:, 0] <= bounding_box[1]),
                          np.logical_and(bounding_box[2] <= towers[:, 1],
                                         towers[:, 1] <= bounding_box[3]))


def dataMirror(towers, bounding_box, perc=.1):
    # Select towers inside the bounding box
    i = in_box(towers, bounding_box)
    # Mirror points
    points_center = towers[i, :]
    points_left = np.copy(points_center)
    points_left[:, 0] = bounding_box[0] - (points_left[:, 0] - bounding_box[0])
    points_right = np.copy(points_center)
    points_right[:, 0] = bounding_box[1] + (bounding_box[1] - points_right[:, 0])
    points_down = np.copy(points_center)
    points_down[:, 1] = bounding_box[2] - (points_down[:, 1] - bounding_box[2])
    points_up = np.copy(points_center)
    points_up[:, 1] = bounding_box[3] + (bounding_box[3] - points_up[:, 1])
    points = np.append(points_center,
                       np.append(np.append(points_left,
                                           points_right,
                                           axis=0),
                                 np.append(points_down,
                                           points_up,
                                           axis=0),
                                 axis=0),
                       axis=0)

    # Trim mirrored frame to withtin a 'perc' pad
    xr, yr = np.ptp(towers.T[0]) * perc, np.ptp(towers.T[1]) * perc
    xmin, xmax = bounding_box[0] - xr, bounding_box[1] + xr
    ymin, ymax = bounding_box[2] - yr, bounding_box[3] + yr
    msk = (points[:, 0] > xmin) & (points[:, 0] < xmax) &\
        (points[:, 1] > ymin) & (points[:, 1] < ymax)
    points = points[msk]

    return points.T


def KDEplot(xmin, xmax, ymin, ymax, values):
    # Gaussian KDE.
    kernel = stats.gaussian_kde(values, bw_method=.2)
    # Grid density (number of points).
    gd_c = complex(0, 50)
    # Define x,y grid.
    x_grid, y_grid = np.mgrid[xmin:xmax:gd_c, ymin:ymax:gd_c]
    positions = np.vstack([x_grid.ravel(), y_grid.ravel()])
    # Evaluate kernel in grid positions.
    k_pos = kernel(positions)

    ext_range = [xmin, xmax, ymin, ymax]
    kde = np.reshape(k_pos.T, x_grid.shape)

    plt.imshow(np.rot90(kde), cmap=plt.get_cmap('RdYlBu_r'), extent=ext_range)


x_data = np.random.uniform(1., 2000., 1000)
y_data = np.random.uniform(1., 2000., 1000)

xmin, xmax = np.min(x_data), np.max(x_data)
ymin, ymax = np.min(y_data), np.max(y_data)
values = np.vstack([x_data, y_data])

# Plot non-mirrored data
plt.subplot(121)
KDEplot(xmin, xmax, ymin, ymax, values)
plt.scatter(*values, s=3, c='k')
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)

# Plot mirrored data
bounding_box = (xmin, xmax, ymin, ymax)
values = dataMirror(values.T, bounding_box)
plt.subplot(122)
KDEplot(xmin, xmax, ymin, ymax, values)
plt.scatter(*values, s=3, c='k')
plt.xlim(xmin, xmax)
plt.ylim(ymin, ymax)

plt.show()
Gabriel
  • 40,504
  • 73
  • 230
  • 404
  • 1
    *believe the mirrored-data KDE represents the actual density better* - couldn't agree more, almost seems like this should be the default – William Miller May 19 '20 at 21:45
  • This is brilliant, but aren't you making some assumption about what happens outside the boundaries? In your example everything is perfectly fine (the surface is supposed to be uniform), but if you are using KDE you probably don't know that. – non87 Feb 04 '21 at 20:05
  • Yes, naturally I am making the assumption that the data outside of the edges is similar to the data located next to the edges (inside the frame). *Some* assumption needs to be made since we don't have data beyond the edges, and I think this is a rather reasonable one. – Gabriel Feb 05 '21 at 01:16
  • @Gabriel Agreed 100% and (again) this is brilliant. I am pointing out the assumption if anyone is using this technique for their data. – non87 Feb 12 '21 at 23:41
  • 1
    Nice solution. But shouldn't you also mirror "diagonally". One can see that the corners are "darker" than expected because of the missing north-west, north-east, south-west and south-east copies. – dannedanne Jun 08 '23 at 18:39
  • @dannedanne yes, that's a good point. Perhaps a sliding window would fix t hat artifact – Gabriel Jun 09 '23 at 17:48