Dot product between two Vector columns of a DataFrame in Spark without UDF

Question

Let's consider a Spark DataFrame with 2 columns, each of them being of Vector type. Is there a way that doesn't involve UDFs to compute the dot product between them?

I am using Spark 2.4 (on DataBricks, in case there is a solution involving their higher-order functions)

score 2 · Accepted Answer · answered Dec 21 '18 at 23:09

There isn't any reasonable* way of doing such thing as Vectors are not native types. Instead they implement UserDefinedTypes and as such can be processed only indirectly.

If data is narrow you might consider converting to matching strongly typed Dataset, but it is unlikely to bring any serious improvement (if not decrease performance).

* One could derive highly indirect solution, fore example by:

Adding unique ID
Dumping vector to JSON.
Reading JSON by reserializing to internal StructType representation.
Exploding vector with pos_explode (DenseVector) or zipping indices and value (SparseVector)
Self joining by unique and index.
Aggregate.

Any such thing would be expensive and completely impractical.

Dot product between two Vector columns of a DataFrame in Spark without UDF

1 Answers1