Spark 2.0.0: How to aggregate DataSet with custom encoded types?

Question

I have some data stored as DataSet[(Long, LineString)] using tuple encoder with a kryo encoder for the LineString

implicit def single[A](implicit c: ClassTag[A]): Encoder[A] = Encoders.kryo[A](c)
implicit def tuple2[A1, A2](implicit
                            e1: Encoder[A1],
                            e2: Encoder[A2]
                           ): Encoder[(A1,A2)] = Encoders.tuple[A1,A2](e1, e2)
implicit val lineStringEncoder = Encoders.kryo[LineString]

val ds = segmentPoints.map(
  sp => {
    val p1 = new Coordinate(sp.lon_ini, sp.lat_ini)
    val p2 = new Coordinate(sp.lon_fin, sp.lat_fin)
    val coords = Array(p1, p2)

    (sp.id, gf.createLineString(coords))
  })
  .toDF("id", "segment")
  .as[(Long, LineString)]
  .cache

ds.show

    +----+--------------------+
    | id |       segment      |
    +----+--------------------+
    | 347|[01 00 63 6F 6D 2...|
    | 347|[01 00 63 6F 6D 2...|
    | 347|[01 00 63 6F 6D 2...|
    | 808|[01 00 63 6F 6D 2...|
    | 808|[01 00 63 6F 6D 2...|
    | 808|[01 00 63 6F 6D 2...|
    +----+--------------------+

I can apply any map operation on the segment column and use the underlying LineStrign methods.

ds.map(_._2.getClass.getName).show(false)

+--------------------------------------+
|value                                 |
+--------------------------------------+
|com.vividsolutions.jts.geom.LineString|
|com.vividsolutions.jts.geom.LineString|
|com.vividsolutions.jts.geom.LineString|

I would like to create some UDAFs to process segments with the same id, I have tried the folling two different approaches without any success:

1) Using Aggregator:

val length = new Aggregator[LineString, Double, Double] with Serializable {
  def zero: Double = 0                     // The initial value.
  def reduce(b: Double, a: LineString) = b + a.getLength    // Add an element to the running total
  def merge(b1: Double, b2: Double) = b1 + b2 // Merge intermediate values.
  def finish(b: Double) = b
  // Following lines are missing on the API doc example but necessary to get
  // the code compile
  override def bufferEncoder: Encoder[Double] = Encoders.scalaDouble
  override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}.toColumn

ds.groupBy("id")
  .agg(length(col("segment")).as("kms"))
  .show(false)

Here I get the following error:

 Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate [id#603L], [id#603L, anon$1(com.test.App$$anon$1@5bf1e07, None, input[0, double, true] AS value#715, cast(value#715 as double), input[0, double, true] AS value#714, DoubleType, DoubleType)['segment] AS kms#721];

2) Using UserDefinedAggregateFunction

class Length extends UserDefinedAggregateFunction {
  val e = Encoders.kryo[LineString]

  // This is the input fields for your aggregate function.
  override def inputSchema: StructType = StructType(
    StructField("segment", DataTypes.BinaryType) :: Nil
  )

  // This is the internal fields you keep for computing your aggregate.
  override def bufferSchema: StructType = StructType(
      StructField("length", DoubleType) :: Nil
  )

  // This is the output type of your aggregatation function.
  override def dataType: DataType = DoubleType

  override def deterministic: Boolean = true

  // This is the initial value for your buffer schema.
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0.0
  }

  // This is how to update your buffer schema given an input.
  override def update(buffer : MutableAggregationBuffer, input : Row) : Unit = {
    // val l0 = input.getAs[LineString](0) // Can't cast to LineString (I guess because it is searialized using given encoder)
    val b = input.getAs[Array[Byte]](0) // This works fine
    val lse = e.asInstanceOf[ExpressionEncoder[LineString]]
    val ls = lse.fromRow(???) // it expects InternalRow  but input is a Row instance
    // I also tried casting b.asInstance[InternalRow] without success.
    buffer(0) = buffer.getAs[Double](0) + ls.getLength
  }

  // This is how to merge two objects with the bufferSchema type.
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getAs[Double](0) + buffer2.getAs[Double](0)
  }

  // This is where you output the final value, given the final value of your bufferSchema.
  override def evaluate(buffer: Row): Any = {
    buffer.getDouble(0)
  }
}

val length = new Length
rseg
  .groupBy("id")
  .agg(length(col("segment")).as("kms"))
  .show(false)

What am I doing wrong? I would like to use aggregation API with custom types instead of using rdd groupBy API. I searched through the Spark doc but couldn't find the answer to this problem, it seems it is on an early stage at the moment.

Thanks.

score 0 · Answer 1 · edited May 23 '17 at 12:07

0

As per this answer, there is no easy way of passing in custom encoders for nested types, i.e. like (Long,LineString) in your case.

One option could be to define a case class LineStringWithID which would extend LineString with id: Long attribute, and use encoders from SQLImplicits

P.S. Can you break down your questions into smaller parts, one topic each?

edited May 23 '17 at 12:07

Community

1
1

answered Jan 13 '17 at 16:31

Alexey Svyatkovskiy

646
5
16

Przemek Jacewicz · Answer 2 · 2020-03-09T13:55:43.600

Maybe someone will also be looking for this: when kryo encoder is used you cannot use untyped, SQL-based API for dataset manipulation. You can only use typed API and in terms of grouping this means you need to use a custom Aggregator, not a custom UserDefinedAggregateFunction. I think your Aggregator implementation is ok, but your grouping should be changed to use typed groupByKey with your custom aggregator instance e.g.

ds.groupByKey(_._1)
  .agg(length)
  .show(false)

Spark 2.0.0: How to aggregate DataSet with custom encoded types?

2 Answers2