Spark - Splitting, Converting and Storing Vector Data into a CSV file

Question

I have a Spark dataframe containing vector data in one of its columns like the first column shown below

+--------------------+-----+-----------+
|            features|Label|OutputLabel|
+--------------------+-----+-----------+
|(1133,[33,296,107...|    0|        0.0|
|(1133,[19,1045,10...|    0|        0.0|
|(1133,[9,398,1075...|    0|        0.0|
|(1133,[0,927,1074...|    0|        0.0|
|(1133,[41,223,107...|    0|        0.0|
|(1133,[70,285,108...|    0|        0.0|
|(1133,[4,212,1074...|    0|        0.0|
|(1133,[25,261,107...|    0|        0.0|
|(1133,[0,258,1074...|    0|        0.0|
|(1133,[2,219,1074...|    0|        0.0|
|(1133,[8,720,1074...|    0|        0.0|
|(1133,[2,260,1074...|    0|        0.0|
|(1133,[54,348,107...|    0|        0.0|
|(1133,[167,859,10...|    0|        0.0|
|(1133,[1,291,1074...|    0|        0.0|
|(1133,[1,211,1074...|    0|        0.0|
|(1133,[23,216,107...|    0|        0.0|
|(1133,[126,209,11...|    0|        0.0|
|(1133,[70,285,108...|    0|        0.0|
|(1133,[96,417,107...|    0|        0.0|
+--------------------+-----+-----------+

Please see below the schema of this dataframe

root
 |-- features: vector (nullable = true)
 |-- Label: integer (nullable = true)
 |-- OutputLabel: double (nullable = true)

Question 1 : I need to split the first column data into two columns so that the integer data should come in one column and the array data should come in other column. Not sure how to do it in Spark / Scala ? Any pointers on this will be helpful.

When I tried to write this dataframe as csv file, I got the below error

Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support struct,values:array> data type.
Question 2 : I understand that even this dataframe cannot be written as a text file since it will write only one column into output file and it should not be of type Struct. So is it possible to write this dataframe after splitting the first column into two separate columns ? The second column data will be of array data type. Can we write into output file in that way ?
Question 3 : By any chance can we write the array data alone into a csv file ?

@RameshMaharjan I have updated the question to have the schema. Please check it out — JKC, Sep 07 '17 at 06:59

Alper t. Turker · Answer 1 · 2017-09-06T18:00:49.610

So is it possible to write this dataframe after splitting the first column into two separate columns ?

No. What you see is just a representation of the SparseVector. Even if you extract indices and values, CSV source supports only atomic types.

If you're dead set at using CSV I'd convert a whole column to JSON

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
  (Vectors.sparse(100, Array(1, 11, 42), Array(1, 2, 3)), 0, 0.0)
)).toDF("features", "label", "outputlabel")

df.withColumn("features", to_json(struct($"features"))).write.csv(...)

To parse it to Vector follow the instructions provided here

Spark - Splitting, Converting and Storing Vector Data into a CSV file

1 Answers1