How does Spark DataFrame distinguish between different VectorUDT objects?

Question

I'm trying to understand DataFrame column types. Of course, DataFrame is not a materialized object, it’s just a set of instructions for Spark, to be converted into code in the future. But I imagined that this list of types represents the types of objects that might materialize inside the JVM when an action is executed.

import pyspark
import pyspark.sql.types as T
import pyspark.sql.functions as F
data = [0, 3, 0, 4]
d = {}
d['DenseVector'] = pyspark.ml.linalg.DenseVector(data)
d['old_DenseVector'] = pyspark.mllib.linalg.DenseVector(data)
d['SparseVector'] = pyspark.ml.linalg.SparseVector(4, dict(enumerate(data)))
d['old_SparseVector'] = pyspark.mllib.linalg.SparseVector(4, dict(enumerate(data)))
df = spark.createDataFrame([d])
df.printSchema()

The columns for the four vector values look the same in printSchema() (or schema):

root
 |-- DenseVector: vector (nullable = true)
 |-- SparseVector: vector (nullable = true)
 |-- old_DenseVector: vector (nullable = true)
 |-- old_SparseVector: vector (nullable = true)

But when I retrieve them row by row, they turn out to be different:

> for x in df.first().asDict().items():
  print(x[0], type(x[1]))
(2) Spark Jobs
old_SparseVector <class 'pyspark.mllib.linalg.SparseVector'>
SparseVector <class 'pyspark.ml.linalg.SparseVector'>
old_DenseVector <class 'pyspark.mllib.linalg.DenseVector'>
DenseVector <class 'pyspark.ml.linalg.DenseVector'>

I'm confused about the meaning of vector type (equivalent to VectorUDT in for purposes of writing a UDF). How does the DataFrame know which of the four vector types it has in each vector column? Is the data in those vector columns stored in the JVM or in python VM? And how come VectorUDT can be stored in the DataFrame, if it's not one of the official types listed here?

(I know that two of the four vector types, from the mllib.linalg, will eventually be deprecated.)

score 9 · Accepted Answer · edited May 23 '17 at 11:53

how come VectorUDT can be stored in the DataFrame

UDT a.k.a User Defined Type should be a hint here. Spark provides (now private) mechanism to store custom objects in DataFrame. You can check my answer to How to define schema for custom type in Spark SQL? or Spark source for details but long story short it is all about deconstructing objects and encoding them as Catalyst types.

I'm confused about the meaning of vector type

Most likely because you're looking at the wrong thing. Short description is useful but it doesn't determine types. Instead you should check the schema. Let's create another data frame:

import pyspark.mllib.linalg as mllib
import pyspark.ml.linalg as ml

df = sc.parallelize([
    (mllib.DenseVector([1, ]), ml.DenseVector([1, ])),
    (mllib.SparseVector(1, [0, ], [1, ]), ml.SparseVector(1, [0, ], [1, ]))
]).toDF(["mllib_v", "ml_v"])

df.show()

## +-------------+-------------+
## |      mllib_v|         ml_v|
## +-------------+-------------+
## |        [1.0]|        [1.0]|
## |(1,[0],[1.0])|(1,[0],[1.0])|
## +-------------+-------------+

and check data types:

{s.name: type(s.dataType) for s in df.schema}

## {'ml_v': pyspark.ml.linalg.VectorUDT,
##  'mllib_v': pyspark.mllib.linalg.VectorUDT}

As you can see UDT types are fully qualified so there is no confusion here.

How does the DataFrame know which of the four vector types it has in each vector column?

As shown above DataFrame knows only its schema and can distinguish between ml / mllib types but doesn't care about vector variant (sparse or dense).

Vector type is determined by its type field (a byte field, 0 -> sparse, 1 -> dense) but overall schema is the same. Also there is no difference in internal representation between ml and mllib.

Is the data in those vector columns stored in the JVM or in Python

DataFrame is a pure JVM entity. Python interoperability is achieved by coupled UDT classes:

Scala UDT may define pyUDT attribute.
Python UDT may define scalaUDT attribute.

Ty! I tried `print(df.schema)`, but string representation of `VectorUDT` _instances_ don't include the full qualifier, and I didn't think of checking their class directly. I just wasn't comfortable with some weird magic, but now everything seems reasonable. — max, Jul 31 '16 at 10:15

How does Spark DataFrame distinguish between different VectorUDT objects?

1 Answers1