1

Let's take the below dataframe as an example

+--------+-----------------------------
|id  | fee_amount   | discount_amount | 
|1   | 10.00        | 5.0             |
|2   | 20.0         | 3.0             |

I want to be able to convert the above dataframe to following

+--------+-----------------------------
|id  | amount_type  | discount_amount |
|1   | fee          |   10.0          |
|1   | discount     |   5.0           |
|2   | fee          |   20.0          |
|2   | discount     |   3.0           |

I just double the number of rows and I'm ok with that.

I only want one column where amount value is stored and another column where a type of amount is stored. In the above example, I am given names of columns i.e. fee_amount, discount_amount that needs to be transposed. Is this even possible to do in spark dataframe?

koiralo
  • 22,594
  • 6
  • 51
  • 72
Nick01
  • 349
  • 2
  • 8
  • 22

1 Answers1

4

One of the solutions is to create an array with column fee_amount and discount_amount and explode (which will add a row)

import org.apache.spark.sql.functions._
val df = Seq(
  (1, 10.00, 5.0),
  (2, 20.00, 3.0)
).toDF("id", "fee_amount", "discount_amount")

val result = df.select($"id", posexplode(array($"fee_amount", $"discount_amount")))

//Now replace the exploded value 0 with fee and 1 with discount
result.withColumn("amount_type", when($"pos" === 0, "fee").otherwise("discount"))
  .drop("pos")
  .withColumnRenamed("col", "discount_amount")
  .show()

Output:

+---+---------------+-----------+
|id |discount_amount|amount_type|
+---+---------------+-----------+
|1  |10.0           |fee        |
|1  |5.0            |discount   |
|2  |20.0           |fee        |
|2  |3.0            |discount   |
+---+---------------+-----------+

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72
  • How would this change if I have 10 other columns like id? – Nick01 Mar 24 '18 at 22:11
  • You can select the other column as well or join the original dataframe to get all other column. – koiralo Mar 25 '18 at 14:00
  • can you give an example? I tried creating an array of columns and pass it to select method but it's complaining about expecting string/column as first argument. – Nick01 Mar 26 '18 at 03:35
  • yeah you can't pass an array you need to select as `($"id", $"col1", $"col2"...... , posexplode())` – koiralo Mar 26 '18 at 04:34
  • oh that's a bummer. I wanted this piece of code to be generic enough. didn't wanna hardcode those columns. – Nick01 Mar 26 '18 at 05:18
  • Another way you can do is select the columns from original df and join it with `result` dataframe – koiralo Mar 26 '18 at 05:19