I have a PySpark Dataframe that was created by pyspark.ml.clustering.LDA
. The topicDistribution
column is a vector of n doubles.
+--------------------+
| topicDistribution|
+--------------------+
|[0.93673575849807...|
|[0.31615978901762...|
|[0.33657712774309...|
|[0.30523697192979...|
+--------------------+
I want to create separate, (non-vector), double columns for each of the vector columns. Ultimately, I'm trying to "unwrap" the vector so I can write the data to a CSV file.
I've tried multiple approaches.
Approach 1 was to simply index into the column
for i in range(3):
df = df.withColumn("Col-" + str(i), df['topicDistribution'][i])
but this produces the error
AnalysisException: u"Can't extract value from topicDistribution#858;"
Approach 2 tried to use a UDF, but as you can see, my UDF isn't passed a vector but a "pickle". And I don't know what to do anything with that.
getTyp = udf(lambda arr: getType(arr,1), StringType())
for i in range(3):
df = df.withColumn("Col-" + str(i), getTyp(df['topicDistribution']))
which returns
+--------------------+--------------------+--------------------+--------------------+
| topicDistribution| Col-0| Col-1| Col-2|
+--------------------+--------------------+--------------------+--------------------+
|[0.93673577353151...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.31615869274437...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.33657583318666...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.30523585516934...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
+--------------------+--------------------+--------------------+--------------------+
Approach 3 used VectorSlicer which came close, but the resulting columns are still vectors.
for i in range(3):
slicer = VectorSlicer(inputCol="topicDistribution", outputCol="Col-" + str(i), indices=[i])
df = slicer.transform(df)
which produces the following. Notice that each column is still a vector (surrounding []'s)
+--------------------+--------------------+--------------------+--------------------+
| topicDistribution| Col-0| Col-1| Col-2|
+--------------------+--------------------+--------------------+--------------------+
|[0.93673576108710...|[0.9367357610871071]|[0.03151327102122...|[0.03175096789167...|
|[0.31615848955402...|[0.31615848955402...|[0.3289336386324864]|[0.35490787181348...|
|[0.33657818512851...|[0.3365781851285112]|[0.32473902350327...|[0.3386827913682095]|
|[0.30523627602677...|[0.30523627602677...|[0.3426806504112193]|[0.3520830735620017]|
+--------------------+--------------------+--------------------+--------------------+
There has to be a simple solution, but I'm stumped.