I'm trying to understand DataFrame column types. Of course, DataFrame is not a materialized object, it’s just a set of instructions for Spark, to be converted into code in the future. But I imagined that this list of types represents the types of objects that might materialize inside the JVM when an action is executed.
import pyspark
import pyspark.sql.types as T
import pyspark.sql.functions as F
data = [0, 3, 0, 4]
d = {}
d['DenseVector'] = pyspark.ml.linalg.DenseVector(data)
d['old_DenseVector'] = pyspark.mllib.linalg.DenseVector(data)
d['SparseVector'] = pyspark.ml.linalg.SparseVector(4, dict(enumerate(data)))
d['old_SparseVector'] = pyspark.mllib.linalg.SparseVector(4, dict(enumerate(data)))
df = spark.createDataFrame([d])
df.printSchema()
The columns for the four vector values look the same in printSchema()
(or schema
):
root
|-- DenseVector: vector (nullable = true)
|-- SparseVector: vector (nullable = true)
|-- old_DenseVector: vector (nullable = true)
|-- old_SparseVector: vector (nullable = true)
But when I retrieve them row by row, they turn out to be different:
> for x in df.first().asDict().items():
print(x[0], type(x[1]))
(2) Spark Jobs
old_SparseVector <class 'pyspark.mllib.linalg.SparseVector'>
SparseVector <class 'pyspark.ml.linalg.SparseVector'>
old_DenseVector <class 'pyspark.mllib.linalg.DenseVector'>
DenseVector <class 'pyspark.ml.linalg.DenseVector'>
I'm confused about the meaning of vector
type (equivalent to VectorUDT
in for purposes of writing a UDF). How does the DataFrame
know which of the four vector types it has in each vector
column? Is the data in those vector columns stored in the JVM or in python VM? And how come VectorUDT
can be stored in the DataFrame
, if it's not one of the official types listed here?
(I know that two of the four vector types, from the mllib.linalg
, will eventually be deprecated.)