-1

I would like to transpose a DataFrame aggregating values per column. Let me ilustrate it with an example:

Given this DataFrame:

val df = sc.parallelize(Seq(("A","B","C"), ("D", "E", "F"), ("X", "Y", "Z"), ("A", "N", "Z"))).toDF("col1", "col2", "col3")
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   B|   C|
|   D|   E|   F|
|   X|   Y|   Z|
|   A|   N|   Z|
+----+----+----+

The expected output should be something like this:

col1: Array("A", "D", "X")
col2: Array("B", "E", "Y", "N")
col3: Array("C", "F", "Z")

Consider the real DataFrame could contain about hundreds of columns. Is not necessary preserve the order of the columns in the output.

Edit: Consider as well you could find repeated elements in the columns, but just want the unique elements.

I am using Spark 2.0.2 with scala 2.11.

Any suggestion?

Thanks in advance!

  • 1
    This [SO link](https://stackoverflow.com/questions/40892459/spark-transpose-dataframe-without-aggregating) might be of interest to you. – Leo C Nov 14 '17 at 08:16

1 Answers1

2

You can apply the groupBy and collect_set but I doubt the performance of that if your row count is on the higher side.

df.show
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   B|   C|
|   D|   E|   F|
|   X|   Y|   Z|
|   A|   N|   Z|
+----+----+----+

import org.apache.spark.sql.functions._
val query = df.columns.map(x => collect_set(col(x) ).as(x) )

df.groupBy(lit(1))
  .agg(collect_set($"col1"), query: _*)
  .select(df.columns.map(col(_)): _* )
  .show
+---------+------------+---------+
|     col1|        col2|     col3|
+---------+------------+---------+
|[A, D, X]|[B, E, Y, N]|[C, F, Z]|
+---------+------------+---------+
Akash Sethi
  • 2,284
  • 1
  • 20
  • 40
philantrovert
  • 9,904
  • 3
  • 37
  • 61