3

I am trying to establish what people would loosely refer to as a homemade KDE - I suppose. I am trying to evaluate a density of a rather huge set of datapoints. In particular, having many data points for a scatter, I want to indicate the density using a color gradient (see link below).

For exemplification, I provide a random pair of (x,y) data below. The real data will be spread on different scales, hence the difference in X and Y grid point spacing.

import numpy as np
from matplotlib import pyplot as plt

def homemadeKDE(x, xgrid, y, ygrid, sigmaX = 1, sigmaY = 1):

    a = np.exp( -((xgrid[:,None]-x)/(2*sigmaX))**2 )
    b = np.exp( -((ygrid[:,None]-y)/(2*sigmaY))**2 ) 

    xweights = np.dot(a, x.T)/np.sum(a)
    yweights = np.dot(b, y.T)/np.sum(b)  

    return xweights, yweights

x = np.random.rand(10000)
x.sort()
y = np.random.rand(10000)

xGrid = np.linspace(0, 500, 501)
yGrid = np.linspace(0, 10, 11)

newX, newY = homemadeKDE(x, xGrid, y, yGrid)

What I am stuck with is, how to project these values back to the original x and y vector so I can use it for plotting a 2D scatter plot (x,y) with a z value for the density colored by a given color map like so:

plt.scatter(x, y, c = z, cmap = "jet")

Plotting and KDE approach is in fact inspired by this great answer

EDIT 1 To smooth out some confusion, the idea is to do a gaussian KDE, which would be on a much coarser grid. SigmaX and sigmaY reflect the bandwidth of the kernel in x and y directions, respectively.

Machavity
  • 30,841
  • 27
  • 92
  • 100
Fourier
  • 2,795
  • 3
  • 25
  • 39
  • The weights are simply a grid-based weighing of the points following this definition: http://mccormickml.com/2014/02/26/kernel-regression/ – Fourier Mar 23 '18 at 15:12
  • What would `z` be in your case? How do you want your `x,y` data to be colored exactly? – Gabriel Mar 23 '18 at 15:14
  • @Gabriel It resembles a coloring vector of the points based on the binning by the continuous gauss functions. Please take a look at the scipy.stats based solution I provided in the link – Fourier Mar 23 '18 at 15:19
  • The `scipy.stats.gaussian_kde()` function returns a single value (the KDE) for each `x,y` pair while your `homemadeKDE()` function returns two. It is not clear to me then what would be the equivalent of the KDE value in your case (i.e: the `z` in the scatter plot) – Gabriel Mar 23 '18 at 15:23
  • That is what I am asking for, a way to a proper back projection. One could simply use `np.dot()` to multiply the weights but then the vector is still smaller than the original one, which is obvious due to my grid points being fewer than the data points. – Fourier Mar 23 '18 at 15:25
  • Where is the assumption coming from that two separate kdes in each dimension would have anything to do with the 2D kde you're aiming at? Could this rather be a maths question after all? – ImportanceOfBeingErnest Mar 23 '18 at 15:32
  • Do you have a specific reason for not using the `scipy.stats.gaussian_kde()` function? – Gabriel Mar 23 '18 at 15:32
  • @ImportanceOfBeingErnest It might as well be. I may do the wrong thing here. May be I have to loop for each pair of points and add up the gaussian sum. I was going for a faster approach which probably is wrong mathematically. – Fourier Mar 23 '18 at 15:35
  • @Gabriel Nothing. And I have used it working perfectly. However, I wanted to write my own code to understand it a bit more clearly. – Fourier Mar 23 '18 at 15:36
  • 1
    Perhaps this will help, take a look at how a multi-variate KDE is generated: http://sebastianraschka.com/Articles/2014_kernel_density_est.html#322-implementing-the-code-to-calculate-the-multivariate-gaussian-densities – Gabriel Mar 23 '18 at 16:00

1 Answers1

0

I was actually- with a little bit of thinking -able to solve the problem on my own. Also thanks to the help and insightful comments.

import numpy as np
from matplotlib import pyplot as plt

def gaussianSum1D(gridpoints, datapoints, sigma=1):

    a = np.exp( -((gridpoints[:,None]-datapoints)/sigma)**2 )

    return a

#some test data
x = np.random.rand(10000)
y = np.random.rand(10000)

#create grids
gridSize = 100
xedges = np.linspace(np.min(x), np.max(x), gridSize)
yedges = np.linspace(np.min(y), np.max(y), gridSize)

#calculate weights for both dimensions seperately
a = gaussianSum1D(xedges, x, sigma=2)
b = gaussianSum1D(yedges, y, sigma=0.1)

Z = np.dot(a, b.T).T

#plot original data
fig, ax = plt.subplots()
ax.scatter(x, y, s = 1)
#overlay data with contours 
ax.contour(xedges, yedges, Z, cmap = "jet")
Fourier
  • 2,795
  • 3
  • 25
  • 39