1

i have a dataframe with many couple (count and score) columns. This situation is not a pivot, but similar an unpivot. Example:


|house_score | house_count | mobile_score | mobile_count | sport_score | sport_count | ....<other couple columns>.....| 
|   20            2              48              6             6             78     |
|   40            78             47              74            69             6     |

I want a new dateframe with only two columns, score e count. The new dataframe reduce all couple columns in a only couple columns.

_________________
| score | count |
|   20  |   2   |
|   40  |   78  |
|   48  |   6   |
|   47  |   74  |
|   6   |   78  |
|   69  |   6   |
|_______________|

What's the best solution (elegant code/performance)?

Arturo Gatto
  • 53
  • 2
  • 9
  • [This](http://stackoverflow.com/questions/30244910/pivot-spark-dataframe) might contain the solution to your question. Your question might be a duplicate – Quickbeam2k1 Mar 30 '17 at 09:26
  • Possible duplicate of [Pivot Spark Dataframe](http://stackoverflow.com/questions/30244910/pivot-spark-dataframe) – Quickbeam2k1 Mar 30 '17 at 09:27

2 Answers2

2

You can achieve this using a foldLeft over the column names (excluding the part after the _). This is reasonably efficient since all intensive operations are distributed, and the code is fairly clean and concise.

// df from example
val df = sc.parallelize(List((20,2,48,6,6,78), (40,78,47,74,69,6) )).toDF("house_score", "house_count", "mobile_score", "mobile_count", "sport_score", "sport_count")

// grab column names (part before the _)
val cols = df.columns.map(col => col.split("_")(0)).distinct

// fold left over all columns
val result = cols.tail.foldLeft( 
   // init with cols.head column
   df.select(col(s"${cols.head}_score").as("score"), col(s"${cols.head}_count").as("count")) 
){case (acc,c) => {
   // union current column c
   acc.unionAll(df.select(col(s"${c}_score").as("score"),     col(s"${c}_count").as("count")))
}}

result.show
Ben Horsburgh
  • 563
  • 4
  • 10
1

Using unionAlls as suggested in another answer will require you to scan the data multiple times and on each scan project the df to only 2 columns. From a performance perspective scanning the data multiple times should be avoided if you can do the work in 1 pass especially if you have large datasets that are not cacheable or you need to do many scans.

You can do it in 1 pass, by generating all the tuples (score, count) and then flat mapping them. I let you decide how elegant it is:

scala> :paste
// Entering paste mode (ctrl-D to finish)

val df = List((20,2,48,6,6,78), (40,78,47,74,69,6))
    .toDF("house_score", "house_count", "mobile_score", "mobile_count", "sport_score", "sport_count")

df.show

val result = df
    .flatMap(r => Range(0, 5, 2).map(i => (r.getInt(i), r.getInt(i + 1))))
    .toDF("score", "count")

result.show


// Exiting paste mode, now interpreting.

+-----------+-----------+------------+------------+-----------+-----------+
|house_score|house_count|mobile_score|mobile_count|sport_score|sport_count|
+-----------+-----------+------------+------------+-----------+-----------+
|         20|          2|          48|           6|          6|         78|
|         40|         78|          47|          74|         69|          6|
+-----------+-----------+------------+------------+-----------+-----------+

+-----+-----+
|score|count|
+-----+-----+
|   20|    2|
|   48|    6|
|    6|   78|
|   40|   78|
|   47|   74|
|   69|    6|
+-----+-----+

df: org.apache.spark.sql.DataFrame = [house_score: int, house_count: int ... 4 more fields]
result: org.apache.spark.sql.DataFrame = [score: int, count: int]
Traian
  • 1,474
  • 13
  • 11