9

I want to know how can I "merge" multiple dataframe columns into one as a string array?

For example, I have this dataframe:

val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")

Which looks like this:

scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
|  1|Jack|   125|   Text|
|  2|Mary|   152|  Text2|
+---+----+------+-------+

scala> df.printSchema
root
 |-- Id: integer (nullable = false)
 |-- Name: string (nullable = true)
 |-- Number: string (nullable = true)
 |-- Comment: string (nullable = true)

How can I transform it so it would look like this:

scala> df.show
+---+-----------------+
| Id|             List|
+---+-----------------+
|  1|  [Jack,125,Text]|
|  2| [Mary,152,Text2]|
+---+-----------------+

scala> df.printSchema
root
 |-- Id: integer (nullable = false)
 |-- List: Array (nullable = true)
 |    |-- element: string (containsNull = true)
zero323
  • 322,348
  • 103
  • 959
  • 935
V. Samma
  • 2,558
  • 8
  • 30
  • 34

2 Answers2

17

Use org.apache.spark.sql.functions.array:

import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")

result.show()
// +---+------------------+
// |Id |List              |
// +---+------------------+
// |1  |[Jack, 125, Text] |
// |2  |[Mary, 152, Text2]|
// +---+------------------+
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • Thanks. This is the correct answer. But going forward with this, I ran into the next problem. It is not related specifically, so I created a new question. Check it out, maybe you can help me again: http://stackoverflow.com/questions/41245227/spark-dataframe-schema-definition-using-reflection-with-case-classes-and-column – V. Samma Dec 20 '16 at 15:01
  • nice answer, this helps me A LOT! – Claudio Cavalcante Apr 11 '17 at 19:57
  • I'm doing something like this but how to ignore null values while constructing the array? – marcia12 Jul 16 '21 at 19:55
  • @marcia12: I was looking for a similar solution. Found [link](https://stackoverflow.com/questions/50215195/remove-null-from-array-columns-in-dataframe-in-scala-with-spark-1-6) and used that as the basis to filter out nulls from the array: def NonNullArray = udf(arry: Seq[String]) => if (arry.size >0) { arry.filterNot(_ == null) } else { null }) – B. Griffiths Sep 24 '21 at 17:10
0

Can also be used with withColumn :

import org.apache.spark.sql.functions as F
   
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))
ZettaP
  • 719
  • 7
  • 11