2

I have two columns: one of type Integer and one of type linalg.Vector. I can convert linalg.Vector to array. Each array has 32 elements. I want to convert each element in the array to a column. So the input is like :

column1                  column2
(3, 5, 25, ...., 12)           3
(2, 7, 15, ...., 10)           4
(1, 10, 12, ..., 35)           2

Output should be:

column1_1  column1_2 column1_3 ......... column1_32     column 2
        3          5        25 .........         12            3
        2          7        15 .........         10            4
        1        1 0        12 .........         12            2

Except, in my case there are 32 elements in the array. It is too many to use the method in question Convert Array of String column to multiple columns in spark scala

I tried a few ways and none of it worked. What is the right way to do this?

Thanks a lot.

Gofrette
  • 468
  • 1
  • 8
  • 18
  • 1
    https://stackoverflow.com/questions/38110038/spark-scala-how-to-convert-dataframevector-to-dataframef1double-fn-d answers question. I could not find it before. Thank you – Gofrette Sep 11 '18 at 14:14

2 Answers2

5
scala> import org.apache.spark.sql.Column
scala> val df = Seq((Array(3,5,25), 3),(Array(2,7,15),4),(Array(1,10,12),2)).toDF("column1", "column2")
df: org.apache.spark.sql.DataFrame = [column1: array<int>, column2: int]

scala> def getColAtIndex(id:Int): Column = col(s"column1")(id).as(s"column1_${id+1}")
getColAtIndex: (id: Int)org.apache.spark.sql.Column

scala> val columns: IndexedSeq[Column] = (0 to 2).map(getColAtIndex) :+ col("column2") //Here, instead of 2, you can give the value of n
columns: IndexedSeq[org.apache.spark.sql.Column] = Vector(column1[0] AS `column1_1`, column1[1] AS `column1_2`, column1[2] AS `column1_3`, column2)

scala> df.select(columns: _*).show
+---------+---------+---------+-------+
|column1_1|column1_2|column1_3|column2|
+---------+---------+---------+-------+
|        3|        5|       25|      3|
|        2|        7|       15|      4|
|        1|       10|       12|      2|
+---------+---------+---------+-------+
sujit
  • 2,258
  • 1
  • 15
  • 24
  • You might want to add an explanation about how you're generating the expression based on the range. – philantrovert Sep 11 '18 at 13:44
  • Hi, array contains 32 elements, not 3. I have edited to make question clearer. Sorry for confusion. Also, this is a duplicate of https://stackoverflow.com/questions/38110038/spark-scala-how-to-convert-dataframevector-to-dataframef1double-fn-d as flagged by @user6910411 – Gofrette Sep 11 '18 at 14:28
  • @Sujit, your solution is working well. But, I could not figure out from your solution that how `column1` without any reference about the dataframe(df) when creating `columns` seq is working. How `getColAtIndex` takes `column1` only from `df`. – Praveen L Sep 12 '18 at 06:48
  • 1
    @Praveen-l, Spark allows us to use `Column` independent of dataframe. If its used that way, then consider it just having metadata on a column's type and name. And, spark lazily applies it to the dataframe as per the context its used in. – sujit Sep 13 '18 at 17:26
  • @Sujit, thanks for your clarification. – Praveen L Sep 14 '18 at 06:57
1

This can be done best by writing a UserDefinedFunction like:

val getElementFromVectorUDF = udf(getElementFromVector(_: Vector, _: Int))
def getElementFromVector(vec: Vector, idx: Int) = {
   vec(idx)
}

You can use it like this then:

df.select(
    getElementFromVectorUDF($"column1", 0) as "column1_0",
    ...
    getElementFromVectorUDF($"column1", n) as "column1_n",
)

I hope this helps.

Elmar Macek
  • 380
  • 4
  • 12