1

I have a dataframe with this schema:

 root  
 |-- customer_id: string (nullable = true)  
 |-- service: struct (nullable = true)  
 |    |-- cat1: array (nullable = true)  
 |    |    |-- element: struct (containsNull = true)  
 |    |    |    |-- category: string (nullable = true)  
 |    |    |    |-- match_id: string (nullable = true)  
 |    |-- cat2: array (nullable = true)  
 |    |    |-- element: struct (containsNull = true)  
 |    |    |    |-- category: string (nullable = true)  
 |    |    |    |-- match_id: string (nullable = true)   

actual data looks like this:

+-----------+-------------------------------------------------------------------------------+  
|customer_id|service                                                                        |  
+-----------+-------------------------------------------------------------------------------+  
|CID1       |[[[cat1, service1], [cat1, service3]],]                                        |  
|CID2       |[[[cat1, service4],], [[cat2, service7], [cat2, service8], [cat2, service9]]]  |  
+-----------+-------------------------------------------------------------------------------+

I hope transformed data can look like this:

+-----------+------+--------------------------------------------------------------------------+  
|customer_id|  cat |  service                                                                 |  
+-----------+------+--------------------------------------------------------------------------+  
|CID1       | cat1 | [[cat1, service1], [cat1, service3]]                                     |  
|CID2       | cat1 | [[cat1, service4]]                                                       |
|CID2       | cat2 | [[cat2, service7], [cat2, service8], [cat2, service9]]                   |  
+-----------+------+--------------------------------------------------------------------------+

or even better(but it'll be simple if I can do above form)

+-----------+------+-----------------------------------+  
|customer_id|  cat |  service                          |  
+-----------+------+-----------------------------------+  
|CID1       | cat1 | [service1, service3]              |  
|CID2       | cat1 | [service4]                        |
|CID2       | cat2 | [service7, service8, service9]]   |  
+-----------+------+-----------------------------------+

where service is a concatenation of original cat1 and cat2.

And 1 thing to notice is there could be many fields under original service, meaning there could be cat1, cat2, cat3 ...

I'm new to Scala as well as Spark, and have searched for a while, but haven't seen similar examples.

JQ.
  • 678
  • 7
  • 17
  • sorry I changed the question a bit, because I realized the actual form I wanted was incorrect. Since there wasn't a correct answer yet, so I guess it's fine. – JQ. Mar 05 '20 at 21:02
  • To break up the `service` struct into its components, try `df.select("service.*")`. – Nick Chammas Mar 05 '20 at 21:23
  • @NickChammas no, df.select("service.*)) explode horizontally, what I need is explode vertically -- turn fields into rows instead of columns – JQ. Mar 05 '20 at 22:08
  • I found this question is similar to mine, and I actually solved my problem using partially the solution there. – JQ. Mar 05 '20 at 22:38
  • I found this question is similar to mine, and I actually solved my problem using partially the solution there: https://stackoverflow.com/questions/46566374/spark-scala-nested-structtype-conversion-to-map. The steps are: 1. turn the structype to Map, then explode the Map column, then I got my result. I'll mark my question as duplicate – JQ. Mar 05 '20 at 22:48
  • Sorry, careless reading on my part. Glad you figured it out. – Nick Chammas Mar 06 '20 at 00:25

1 Answers1

0

you could explode your service column twice and collect list by grouping customer_id:

val explodedOnceDF = df.select(col("customer_id"),explode("service").as("service"))
val explodedTwiceDF = explodedOnceDF.select(col("customer_id"),explode("service").as("service"))
val requiredOutput = explodedTwiceDF.groupBy("customer_id").agg(collect_list("service").as("service")).select(col("customer_id"),col("service"))

Hope this Helps!!

Suhas NM
  • 960
  • 7
  • 10
  • mate, it doesn't work. Because "service" column is of struct type, you can only explode array or map types – JQ. Mar 04 '20 at 05:44