2

I saw this question here: Transpose DataFrame Without Aggregation in Spark with scala and I wanted to do exactly the opposite.

I have this Dataframe with a single row, with values that are string, int, bool, array:

+-----+-------+-----+------+-----+
|col1 | col2  |col3 | col4 |col5 |
+-----+-------+-----+------+-----+
|val1 | val2  |val3 | val4 |val5 |
+-----+-------+-----+------+-----+

And I want to transpose it like this:

+-----------+-------+
|Columns    | values|
+-----------+-------+
|col1       | val1  |
|col2       | val2  |
|col3       | val3  |
|col4       | val4  |
|col5       | val5  |
+-----------+-------+

I am using Apache Spark 2.4.3 with Scala 2.11

Edit: Values can be of any type (int, double, bool, array), not only strings.

Marcelo Li Koga
  • 456
  • 1
  • 3
  • 13

2 Answers2

3

From Spark-2.4 Use arrays_zip with array(column_values), array(column_names) then explode to get the result.

Example:

val df=Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")

val cols=df.columns.map(x => col(s"${x}"))

val str_cols=df.columns.mkString(",")

df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//|  val1|   col1|
//|  val2|   col2|
//|  val3|   col3|
//|  val4|   col4|
//|  val5|   col5|
//+------+-------+

UPDATE:

val df=Seq(((2),(3),(true),(2.4),("val"))).toDF("col1","col2","col3","col4","col5")

df.printSchema
//root
// |-- col1: integer (nullable = false)
// |-- col2: integer (nullable = false)
// |-- col3: boolean (nullable = false)
// |-- col4: double (nullable = false)
// |-- col5: string (nullable = true)

//cast to string
val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))

val str_cols=df.columns.mkString(",")

df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()

//+------+-------+
//|values|Columns|
//+------+-------+
//|     2|   col1|
//|     3|   col2|
//|  true|   col3|
//|   2.4|   col4|
//|   val|   col5|
//+------+-------+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • nice what if if OP using lower version of spark – Ram Ghadiyaram Apr 27 '20 at 23:25
  • @RamGhadiyaram, Thanks!, then we need to use `udf` in scala taking reference as this post https://stackoverflow.com/a/41027619 – notNull Apr 27 '20 at 23:55
  • @check my answer with out udf – Ram Ghadiyaram Apr 28 '20 at 00:28
  • 1
    @Shu, you can use SparkSQL function stack, check: https://stackoverflow.com/questions/42465568/unpivot-in-spark-sql-pyspark. – jxc Apr 28 '20 at 11:02
  • Hmmm I forgot to specify the types of the values.. Actually they are mixed (not only string, they can be int, float, bool, etc) and I got this Exception: `org.apache.spark.sql.AnalysisException: cannot resolve 'map[...]' due to data type mismatch: The given values of function map should all be the same type` – Marcelo Li Koga Apr 28 '20 at 14:25
  • @MarceloLiKoga, Please check my `updated` answer, We can cast to **`string`** and then explode to get the desired result for any `datatype`. – notNull Apr 28 '20 at 15:37
  • yes, it is working now! Just adjusted to `df.withColumn("new",explode(arrays_zip(split(lit(str_cols),","), array(cols:_*)))).` for the correct order of columns – Marcelo Li Koga Apr 28 '20 at 23:10
3

Thought differently with out using arrays_zip (which is available in => Spark 2.4)] and got the below...

It will work for Spark =>2.0 onwards in a simpler way (flatmap , map and explode functions)...

Here map function (used in with column) creates a new map column. The input columns must be grouped as key-value pairs.

Case : String data type in Data :

import org.apache.spark.sql.functions._

val df: DataFrame =Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")

var columnsAndValues = df.columns.flatMap { c => Array(lit(c), col(c)) }
df.printSchema()

df.withColumn("myMap", map(columnsAndValues:_*)).select(explode($"myMap"))
  .toDF("Columns","Values").show(false)

Result :

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: string (nullable = true)
 |-- col4: string (nullable = true)
 |-- col5: string (nullable = true)

+-------+------+
|Columns|Values|
+-------+------+
|col1   |val1  |
|col2   |val2  |
|col3   |val3  |
|col4   |val4  |
|col5   |val5  |
+-------+------+

Case : Mix of data types in Data :

If you have different types convert them to String... remaining steps wont change..

val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)

Full Example :

import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.Column

val df = Seq(((2), (3), (true), (2.4), ("val"))).toDF("col1", "col2", "col3", "col4", "col5")
df.printSchema()
/**
  * convert all columns to  to string type since its needed further
  */
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
df1.printSchema()
var ColumnsAndValues: Array[Column] = df.columns.flatMap { c => {
    Array(lit(c), col(c))
  }
}

df1.withColumn("myMap", map(ColumnsAndValues: _*))
   .select(explode($"myMap"))
   .toDF("Columns", "Values")
   .show(false)

Result :

root
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: boolean (nullable = false)
 |-- col4: double (nullable = false)
 |-- col5: string (nullable = true)

root
 |-- col1: string (nullable = false)
 |-- col2: string (nullable = false)
 |-- col3: string (nullable = false)
 |-- col4: string (nullable = false)
 |-- col5: string (nullable = true)

+-------+------+
|Columns|Values|
+-------+------+
|col1   |2     |
|col2   |3     |
|col3   |true  |
|col4   |2.4   |
|col5   |val   |
+-------+------+
jwvh
  • 50,871
  • 7
  • 38
  • 64
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • Hmmm I forgot to specify the types of the values.. Actually they are mixed (not only string, they can be int, float, bool, etc) and I got this Exception: `org.apache.spark.sql.AnalysisException: cannot resolve 'map[...]' due to data type mismatch: The given values of function map should all be the same type` – Marcelo Li Koga Apr 28 '20 at 14:20
  • Do you know how this can be done in python i.e pyspark? – Nikunj Kakadiya Mar 11 '21 at 09:16