1

I have a Spark DataFrame with latitude and longitude where I'm trying to calculate the distance between the coordinates and a polyline.

The dataframe I'm working with is large (about 10 billion observations) and I load using spark_read_parquet(), so it's originally loaded as a Spark dataframe. I'm wondering how to convert that to a Spark Spatial dataframe so then--using geospark--I can use sf methods to calculate the distance from the points to the line.

(Or the ultimate goal is to calculate the distance between the lat/lon in the Spark dataframe to one other polyline, so if there's another way to go about this then that works too).

The below code sets up an example with dummy data

## Setup
library(dplyr)
library(sparklyr)
library(geospark)

sc <- spark_connect(master="local")
register_gis(sc)

## Make dummy dataframe data
df <- data.frame(uid = 1:10,
                 latitude = 1:10,
                 longitude = 1:10)

s_df <- sdf_copy_to(sc = sc, x = df, overwrite = T)

## Make dummy polyline data
polyl <- rbind(c(0,3),c(0,4),c(1,5),c(2,5)) %>% st_linestring()
polyl_sf <- data.frame(id = 1, geom = st_sfc(polyl)) %>% st_as_sf(crs = 4326)

The below code converts from a datafrasme to a spatial dataframe (sf object)

## Convert from dataframe to spatial dataframe
df %>% 
  st_as_sf(coords = c("longitude", "latitude"),
           crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")

But the below code returns the below error

## Attempt to convert from Spark dataframe to Spark spatial dataframe
s_df %>% 
  st_as_sf(coords = c("longitude", "latitude"),
           crs = "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")
Error in UseMethod("st_as_sf") : 
  no applicable method for 'st_as_sf' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

Or not sure if this should somehow be done using st_point?

Rob Marty
  • 378
  • 1
  • 9
  • "The dataframe I'm working with is large" - how large / how many POINTs? And is it just one POLYLINE you're measuring against? – SymbolixAU Dec 07 '22 at 21:48
  • (1) The dataframe is about 10 billion observations (too large to load into my memory, so would prefer to keep it as a spark dataframe). (2) And yep, just measuring against one POLYLINE – Rob Marty Dec 07 '22 at 21:56

0 Answers0