1

Question

In Apache Spark, how to sum the total distance travelled, from the sequence of GPS coordinates (each represents the point visited) without collecting the data to the Scala driver node.

Coordinate is represented by a Location object and the sequence of movement is represented with RDD[Location] (ie L0 -> L1 -> L2 -> ... Ln).

case class Location(latitude: Double, longitude: Double)

The simplest DataFrame having only two sequence of coordinates is below but actually many subsequent coordinates following.

+--------+---------+
|latitude|longitude|
+--------+---------+
|    10.0|     20.0|  
|    40.0|     20.0|
+--------+---------+

Problem

Trying to figure out if there is a way to go through L0 -> L1 -> ... Ln and sum the total moving distance without .collect method loading data into the driver program while summing up the distances moved.

There is no foldLeft equivalent in Spark. The fold or reduce in Spark requires an accumulator of the same type of the result (Int). The aggregate method seqop operator seems not be usable to calculate distance using the previous coordinate data.

Hence I suppose those cannot be used to go through the RDD[Location] and keep accumulating the total distance of type Int by keep calculating L(i) - L(i-1).

If it is a simple word counting, reduceByKey will do. However, it will not do to generate Int accumulation from RDD[Location(latitude: Double, longitude: Double)].

Trying to think of a way to calculate but not found yet. Please suggest a solution or idea.

Code

Below did not and will not work in my understanding as the code is executed remotely on each node.

There needs to

private[weather] def distance(): Int = {
  val coordinates = observationDF.select("latitude", "longitude")
      .rdd
      .map(row => Location(row(0).toString.toDouble, row(1).toString.toDouble))
      //.collect()

  var total:  Int = 0
  var from: Location = coordinates.first()
  //var from: Location = coordinates.head

  val getTotalDistance: (Location) => Unit = (to) => {
    total += calculateDistanceInKilometer(from, to)
    from = to

    println(s"location is $to)")
    println(s"total is $total")  // non zero
  }

  coordinates.foreach(getTotalDistance(_))
  println("Final total is " + " " + total.toString) // zero
  total
}

Output:

location is Location(40.0,20.0))
total is 3335.toString

location is Location(10.0,20.0))
total is 0.toString

Final total is  0

Research

There are several articles to calculate distances for K-means but so far not found one related with a sequence of coordinates.

References

mon
  • 18,789
  • 22
  • 112
  • 205

0 Answers0