Kernel Density estimation in projected coordinate system

Question

I am using the Kernel Density Estimation from the Python module sklearn. My data is in a Geopandas GeoDataframe. Currently, I am doing this in geographic coordinate (EPSG:4326). However, I want to do this with projected coordinates in UTM (EPSG:25833). The KDE works when I leave my data in 4326, however, when I reproject the GeoDataframe to 25833, the KDE gives an empty output.

Example taken from here: https://pygis.io/docs/e_summarize_vector.html#method-2-display-and-export-with-scikit-learn

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import KernelDensity

# County boundaries
# Source: https://opendata.mtc.ca.gov/datasets/san-francisco-bay-region-counties-clipped?geometry=-125.590%2C37.123%2C-119.152%2C38.640
counties = gpd.read_file("../_static/e_vector_shapefiles/sf_bay_counties/sf_bay_counties.shp")


# Well locations
# Source: https://gis.data.ca.gov/datasets/3a3e681b894644a9a95f9815aeeeb57f_0?geometry=-123.143%2C36.405%2C-119.230%2C37.175
# Modified by author so that only the well locations within the counties and the surrounding 50 km were kept
wells = gpd.read_file("../_static/e_vector_shapefiles/sf_bay_wells_50km/sf_bay_wells_50km.shp")

# Set projection to WGS 84 and reproject data
proj_wgs = 4326
counties_wgs = counties.to_crs(proj_wgs)
wells_wgs = wells.to_crs(proj_wgs)

# Get X and Y coordinates of well points
x_sk = wells_wgs["geometry"].x
y_sk = wells_wgs["geometry"].y

# Get minimum and maximum coordinate values of well points
min_x_sk, min_y_sk, max_x_sk, max_y_sk = wells_wgs.total_bounds

# Create a cell mesh grid
# Horizontal and vertical cell counts should be the same
XX_sk, YY_sk = np.mgrid[min_x_sk:max_x_sk:100j, min_y_sk:max_y_sk:100j]

# Create 2-D array of the coordinates (paired) of each cell in the mesh grid
positions_sk = np.vstack([XX_sk.ravel(), YY_sk.ravel()]).T

# Create 2-D array of the coordinate values of the well points
Xtrain_sk = np.vstack([x_sk, y_sk]).T

# Get kernel density estimator (can change parameters as desired)
kde_sk = KernelDensity(bandwidth = 0.04, metric = 'euclidean', kernel = 'gaussian', algorithm = 'auto')

# Fit kernel density estimator to wells coordinates
kde_sk.fit(Xtrain_sk)

# Evaluate the estimator on coordinate pairs
Z_sk = np.exp(kde_sk.score_samples(positions_sk))

# Reshape the data to fit mesh grid
Z_sk = Z_sk.reshape(XX_sk.shape)

fig, ax = plt.subplots(1, 1, figsize = (10, 10))
ax.imshow(np.rot90(Z_sk), cmap = "RdPu", extent = [min_x_sk, max_x_sk, min_y_sk, max_y_sk])
ax.plot(x_sk, y_sk, 'k.', markersize = 2, alpha = 0.1)
counties_wgs.plot(ax = ax, color = 'none', edgecolor = 'dimgray')
ax.set_title('San Francisco Bay Area - SciKit-Learn Kernel Density Estimation for Wells', fontdict = {'fontsize': '15', 'fontweight' : '3'})
plt.show()

This works. However, when I set proj_wgs = 25833 the result is empty.

How can I do the KDE from the sklearn module in projected coordinates?

Instead of converting CRS systems with `gdf.to_crs(25833)`, use the following: `gdf.to_crs('epsg:25833')`. Does that help at all? — Felipe D., Mar 18 '22 at 15:28
Btw, why are you using 'epsg:25833'? According to [this site](https://epsg.io/25833), that's a coordinate system that's supposed to be used for data in Europe. That's likely why the data is coming out blank when you try to transform it. Try instead using 'epsg:32610'. That's specifically for the San Francisco Bay area. — Felipe D., Mar 21 '22 at 14:56

score 1 · Accepted Answer · answered Apr 07 '22 at 08:39

I cross-posted this on skitlearn GitHub Page, beause I assumed this is too specific for stackoverflow. I got the following response from cmarm:

When you convert from geographic coordinates to projected coordinates the scale of the coordinate variation also changes. In your example, geographic coordinates cover around a tenth of degrees, projected ones cover hundreds of thousands of meters. The bandwidth of the kernel estimation is particularly affected by the coordinate variation, it is related to the resolution of the estimation of your density (see for example the 1D histogram example in the documentation). To obtain similar results in the two cases the bandwidth should be increased, taking into account the relation between the linear and projected coordinates. On Earth this means that your bandwidth should be increased by a factor of ~10^5.

# Get kernel density estimator (can change parameters as desired)
kde_sk = KernelDensity(bandwidth = 4000, metric = 'euclidean', kernel = 'gaussian', algorithm = 'auto')

That said, the better way to avoid this kind of dependencies is to use the correct metric for spherical problems: the Haversine metric is available in scikit-learn. This is how a similar problem is solved in the related example.

All credits to cmarm

Kernel Density estimation in projected coordinate system

1 Answers1