2

I have a PySpark Dataframe that was created by pyspark.ml.clustering.LDA. The topicDistribution column is a vector of n doubles.

+--------------------+
|   topicDistribution|
+--------------------+
|[0.93673575849807...|
|[0.31615978901762...|
|[0.33657712774309...|
|[0.30523697192979...|
+--------------------+

I want to create separate, (non-vector), double columns for each of the vector columns. Ultimately, I'm trying to "unwrap" the vector so I can write the data to a CSV file.

I've tried multiple approaches.
Approach 1 was to simply index into the column

for i in range(3): 
    df = df.withColumn("Col-" + str(i), df['topicDistribution'][i])

but this produces the error

AnalysisException: u"Can't extract value from topicDistribution#858;"

Approach 2 tried to use a UDF, but as you can see, my UDF isn't passed a vector but a "pickle". And I don't know what to do anything with that.

getTyp = udf(lambda arr: getType(arr,1), StringType())
for i in range(3): 
    df = df.withColumn("Col-" + str(i), getTyp(df['topicDistribution']))

which returns

+--------------------+--------------------+--------------------+--------------------+
|   topicDistribution|               Col-0|               Col-1|               Col-2|
+--------------------+--------------------+--------------------+--------------------+
|[0.93673577353151...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.31615869274437...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.33657583318666...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.30523585516934...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
+--------------------+--------------------+--------------------+--------------------+

Approach 3 used VectorSlicer which came close, but the resulting columns are still vectors.

for i in range(3): 
    slicer = VectorSlicer(inputCol="topicDistribution", outputCol="Col-" + str(i), indices=[i])
    df = slicer.transform(df)

which produces the following. Notice that each column is still a vector (surrounding []'s)

+--------------------+--------------------+--------------------+--------------------+
|   topicDistribution|               Col-0|               Col-1|               Col-2|
+--------------------+--------------------+--------------------+--------------------+
|[0.93673576108710...|[0.9367357610871071]|[0.03151327102122...|[0.03175096789167...|
|[0.31615848955402...|[0.31615848955402...|[0.3289336386324864]|[0.35490787181348...|
|[0.33657818512851...|[0.3365781851285112]|[0.32473902350327...|[0.3386827913682095]|
|[0.30523627602677...|[0.30523627602677...|[0.3426806504112193]|[0.3520830735620017]|
+--------------------+--------------------+--------------------+--------------------+

There has to be a simple solution, but I'm stumped.

user2906838
  • 1,178
  • 9
  • 20
Scott Gerard
  • 101
  • 1
  • 4

0 Answers0