3

I'm plotting 550,000,000 latitudes and longitudes using datashader. But, for this to be useful, I need to overlay map tiles and polygons using geoviews. The problem is that geoviews.points() and associated projection results in a large slowdown that makes the interactive nature of holoview + bokeh plots redundant.

There's a reproducible example below, but, in short - I'm trying to make the geoviews implementation (3) fast enough to work interactively.

First setup some data

import numpy as np
import pandas as pd
import dask.dataframe as dd
import datashader as ds
import datashader.transfer_functions as tf
import holoviews as hv 
from holoviews.operation.datashader import datashade
import geopandas as gpd
import geoviews as gv

Scaling the size of the data down by 10 for example.

uk_bounding_box = (-14.02,2.09,49.67,61.06)
n = int(550000000 / 10)

# Generate some fake data of the same size
df = dd.from_pandas(
    pd.DataFrame.from_dict({
        'longitude': np.random.normal(
            np.mean(uk_bounding_box[0:2]),
            np.diff(uk_bounding_box[0:2]) / 5, n
        ),
        'latitude': np.random.normal(
            np.mean(uk_bounding_box[2:4]),
            np.diff(uk_bounding_box[2:4]) / 5, n
        )
    }), npartitions=8
)

# Persist data in memory so reading wont slow down datashader
df = df.persist()

(1) Just datashader

Just using datashader without holoviews or geo is very quick - output is rendered in 4 seconds, including aggregation, so re-renders would be event faster if interactive.

# Set some plotting params
bounds = dict(x_range = uk_bounding_box[0:2],
              y_range = uk_bounding_box[2:4])
plot_width = 400
plot_height = 300 

The time for pure datashader version:

%%time
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds)
agg = cvs.points(df, 'longitude', 'latitude', ds.count())

CPU times: user 968 ms, sys: 29.9 ms, total: 998 ms Wall time: 506 ms

tf.shade(agg)

just data shader

(2) datashader in holoviews without geoviews projection

# Set some params
sizes = dict(width=plot_width, height=plot_height)
opts = dict(bgcolor="black", **sizes)

hv.extension('bokeh')

hv.util.opts('Image Curve RGB Polygons [width=400 height=300 shared_axes=False] {+axiswise} ')

Without any projection this is comparable to using pure datashader

%%time
points = hv.Points(df, ['longitude', 'latitude']).redim.range(
    x=bounds['x_range'], y=bounds['y_range'])

shader = datashade(points, precompute=True ,**sizes).options(**opts)

CPU times: user 3.32 ms, sys: 131 µs, total: 3.45 ms Wall time: 3.47 ms

shader

holoviews render

(3) datashader in holoviews with geoviews tiles, polygons, and projection

Here's the crux of the issue - I want to align, the datashader layer with some map tiles and geospatial polygons. This results in a large slowdown that, for the size of data I'm dealing with, makes the interactive visualisation redundant. (12 minutes wait time in total for the render).

I'm sure this is to do with the overhead associated with projecting the points - is there a way to avoid this or any other workaround like precomputing the projection?

# Grab an example shape file to work with
ne_path = gpd.datasets.get_path('naturalearth_lowres')
example_shapes_df = gpd.read_file(ne_path)
uk_shape = example_shapes_df[example_shapes_df.name.str.contains('United K')]


# Grab maptiles
map_tiles = gv.tile_sources.ESRI

# In actual workflow I need to add some polygons
polys = gv.Polygons(uk_shape)

This is as above with the addition of gv.points() and projection

%%time 
points = gv.Points(df, ['longitude', 'latitude']).redim.range(
    x=bounds['x_range'], y=bounds['y_range'])

projected = gv.operation.project_points(points)

shader = datashade(projected, precompute=True ,**sizes).options(**opts)

CPU times: user 11.8 s, sys: 3.16 s, total: 15 s Wall time: 12.5 s

shader * map_tiles * polys

geoviews render

wab
  • 797
  • 6
  • 19
  • Could you clarify the timings? You say it takes 12 minutes, but your %%time output says 12 seconds, and when running your example locally it takes about 30 seconds to project and then about 1.5 seconds to render and 300ms to update on zoom. – philippjfr Nov 29 '18 at 17:56
  • Hi @philippjfr - in the actual code the data is 10 times the size of the example. `n = int(550000000 / 10)` I've scaled it here so it didn't take ages to run the example, but when I use the full 550,000,000 coordinates the final plot takes 12 mins. – wab Nov 29 '18 at 18:49
  • Here's each time on the actual data, same code as above but for 550,000,00 points: 1) `CPU times: user 8.4 s, sys: 16.1 s, total: 24.4 s Wall time: 4.38 s` 2) `CPU times: user 3.9 ms, sys: 476 µs, total: 4.38 ms Wall time: 4.35 ms` 3) `CPU times: user 3min 24s, sys: 6min 27s, total: 9min 51s Wall time: 10min 31s` – wab Nov 29 '18 at 21:19
  • Looking at the profile it seems projection is what is taking most of the time: ``` ncalls tottime percall cumtime percall filename:lineno(function) 1 268.726 268.726 268.726 268.726 {method 'transform_points' of 'cartopy._crs.CRS' objects} 1 191.124 191.124 625.980 625.980 projection.py:188(_process_element) 8 127.947 15.993 127.947 15.993 {built-in method numpy.core.multiarray.concatenate} 1 34.910 34.910 34.910 34.910 projection.py:195() 1 2.001 2.001 627.981 627.981 dimension.py:712(map) ``` – wab Nov 29 '18 at 22:10
  • 1
    Thanks, should have read more carefully. We have plans to eventually parallelize the projecting code with dask but for the time being I can only suggest that you project the points once and then save the projected points as a parquet file. – philippjfr Nov 29 '18 at 22:12

1 Answers1

1

As suggested by @philippjfr the solution is to project the coordinates into the appropriate coordinate system and render using methods 2 or 3 above.

Something like this:

import cartopy

def platcaree_to_mercator_vectorised(x, y):
    '''Use cartopy to convert Platecarree coords to Mercator.'''
    return(cartopy.crs.GOOGLE_MERCATOR.transform_points(
        cartopy.crs.PlateCarree(), x, y))

def platcaree_for_map_partitions(pddf):
    '''Wrapper to apply mercator conversion and convert back to dataframe for Dask.'''
    as_arrays = platcaree_to_mercator_vectorised(pddf.longitude.values,pddf.latitude.values)
    as_df = pd.DataFrame.from_records(as_arrays[:, :2], columns=['longitude', 'latitude'])
    return(as_df)


# Project the points
df_projected = df.map_partitions(platcaree_for_map_partitions,
                                 meta={'longitude': 'f8', 'latitude': 'f8'})
from dask.diagnostics import ProgressBar
with ProgressBar():
    df_projected.to_parquet('abb_projected.parquet', compression='SNAPPY')

Then use this projected dataset with method 2 or 3, detailed in question.

wab
  • 797
  • 6
  • 19
  • If you've already done the projection, you can use either HoloViews or GeoViews as appropriate; GeoViews won't reproject your data unnecessarily. So you don't have to *avoid* geoviews after projecting, you just don't particularly need it. – James A. Bednar Dec 05 '18 at 15:51