I want to unfold a vector column into normal columns in a dataframe. .transform creates individual columns, but there is something wrong with datatypes or ‘nullable’ that gives an error when I try to .show – see an example code below. How to fix the problem?
from pyspark.sql.types import *
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import udf
spark = SparkSession\
.builder\
.config("spark.driver.maxResultSize", "40g") \
.config('spark.sql.shuffle.partitions', '2001') \
.getOrCreate()
data = [(0.2, 53.3, 0.2, 53.3),
(1.1, 43.3, 0.3, 51.3),
(2.6, 22.4, 0.4, 43.3),
(3.7, 25.6, 0.2, 23.4)]
df = spark.createDataFrame(data, ['A','B','C','D'])
df.show(3)
df.printSchema()
vecAssembler = VectorAssembler(inputCols=['C','D'], outputCol="features")
new_df = vecAssembler.transform(df)
new_df.printSchema()
new_df.show(3)
split1_udf = udf(lambda value: value[0], DoubleType())
split2_udf = udf(lambda value: value[1], DoubleType())
new_df = new_df.withColumn('c1', split1_udf('features')).withColumn('c2', split2_udf('features'))
new_df.printSchema()
new_df.show(3)