Question
In Apache Spark, how to sum the total distance travelled, from the sequence of GPS coordinates (each represents the point visited) without collecting the data to the Scala driver node.
Coordinate is represented by a Location object and the sequence of movement is represented with RDD[Location] (ie L0 -> L1 -> L2 -> ... Ln).
case class Location(latitude: Double, longitude: Double)
The simplest DataFrame having only two sequence of coordinates is below but actually many subsequent coordinates following.
+--------+---------+
|latitude|longitude|
+--------+---------+
| 10.0| 20.0|
| 40.0| 20.0|
+--------+---------+
Problem
Trying to figure out if there is a way to go through L0 -> L1 -> ... Ln and sum the total moving distance without .collect method loading data into the driver program while summing up the distances moved.
There is no foldLeft equivalent in Spark. The fold or reduce in Spark requires an accumulator of the same type of the result (Int). The aggregate method seqop operator seems not be usable to calculate distance using the previous coordinate data.
Hence I suppose those cannot be used to go through the RDD[Location] and keep accumulating the total distance of type Int by keep calculating L(i) - L(i-1).
If it is a simple word counting, reduceByKey will do. However, it will not do to generate Int accumulation from RDD[Location(latitude: Double, longitude: Double)].
Trying to think of a way to calculate but not found yet. Please suggest a solution or idea.
Code
Below did not and will not work in my understanding as the code is executed remotely on each node.
There needs to
private[weather] def distance(): Int = {
val coordinates = observationDF.select("latitude", "longitude")
.rdd
.map(row => Location(row(0).toString.toDouble, row(1).toString.toDouble))
//.collect()
var total: Int = 0
var from: Location = coordinates.first()
//var from: Location = coordinates.head
val getTotalDistance: (Location) => Unit = (to) => {
total += calculateDistanceInKilometer(from, to)
from = to
println(s"location is $to)")
println(s"total is $total") // non zero
}
coordinates.foreach(getTotalDistance(_))
println("Final total is " + " " + total.toString) // zero
total
}
Output:
location is Location(40.0,20.0))
total is 3335.toString
location is Location(10.0,20.0))
total is 0.toString
Final total is 0
Research
There are several articles to calculate distances for K-means but so far not found one related with a sequence of coordinates.