1

Let's consider a Spark DataFrame with 2 columns, each of them being of Vector type. Is there a way that doesn't involve UDFs to compute the dot product between them?

I am using Spark 2.4 (on DataBricks, in case there is a solution involving their higher-order functions)

datapug
  • 2,261
  • 1
  • 17
  • 33

1 Answers1

2

There isn't any reasonable* way of doing such thing as Vectors are not native types. Instead they implement UserDefinedTypes and as such can be processed only indirectly.

If data is narrow you might consider converting to matching strongly typed Dataset, but it is unlikely to bring any serious improvement (if not decrease performance).


* One could derive highly indirect solution, fore example by:

  • Adding unique ID
  • Dumping vector to JSON.
  • Reading JSON by reserializing to internal StructType representation.
  • Exploding vector with pos_explode (DenseVector) or zipping indices and value (SparseVector)
  • Self joining by unique and index.
  • Aggregate.

Any such thing would be expensive and completely impractical.