1

I have an column containing an array of phone numbers represented as structs, and need to put them into three columns by a "type" attribute (phone1, phone2, fax).

Here are two sample values of the column.

[{"number":"000-000-0000","type":"Phone1"},{"number":"000-000-0001","type":"Phone2"},{"number":"000-000-0002","type":"Fax"}]
[{"number":"000-000-1000","type":"Phone1"},{"number":"000-000-1001","typeCode":"Fax"},{"number":"000-000-1002","type":"Phone2"}]

I want to split each into three columns, one for each type. I want something like this:

   Phone1           Phone2           Fax
000-000-0000     000-000-0001     000-000-0002
000-000-1000     000-000-1002     000-000-1001

This answer shows how to put each element of the array into its own column. How to explode an array into multiple columns in Spark

This gets me halfway there, but I can't rely on the order of items in the array. If I do this I'll get something like this, where the Phone2 and Fax values in the second column are out of place.

   Phone1           Phone2           Fax
000-000-0000     000-000-0001     000-000-0002
000-000-1000     000-000-1001     000-000-1002

How can I split the single column value into three columns, using the type value? An array can have 0-3 numbers, but will never have more than one number of each type.

SCouto
  • 7,808
  • 5
  • 32
  • 49
nstudenski
  • 186
  • 1
  • 9

1 Answers1

1

Here's one way which involves flattening the phone/fax#s via explode, followed by pivoting on the typeCode, as shown in the following example:

case class Contact(number: String, typeCode: String)

val df = Seq(
  (1, Seq(Contact("111-22-3333", "Phone1"), Contact("111-44-5555", "Phone2"), Contact("111-66-7070", "Fax"))),
  (2, Seq(Contact("222-33-4444", "Phone1"), Contact("222-55-6060", "Fax"), Contact("111-77-8888", "Phone2")))
).toDF("user_id", "contacts")

df.
  withColumn("contact", explode($"contacts")).
  groupBy($"user_id").pivot($"contact.typeCode").agg(first($"contact.number")).
  show(false)
// +-------+-----------+-----------+-----------+
// |user_id|Fax        |Phone1     |Phone2     |
// +-------+-----------+-----------+-----------+
// |1      |111-66-7070|111-22-3333|111-44-5555|
// |2      |222-55-6060|222-33-4444|111-77-8888|
// +-------+-----------+-----------+-----------+
Leo C
  • 22,006
  • 3
  • 26
  • 39
  • What if the DF contains other columns, which you want to keep? One can add them into the `groupBy` clause, but that doesn't work with unorderable types like a MapType.. The only solution I can think of involves joining those columns back to the pivoted DF.. all fairly expensive operations.. – Melkor.cz Oct 21 '19 at 10:20