1

I have a dataframe with following schema:

id         : int,
emp_details: Array(String)

Some sample data:

1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)

This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:

.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)

Could you please guide how we can achieve this by using Spark (1.6) Scala.

Really appreciate your help...

Thanks a lot

koiralo
  • 22,594
  • 6
  • 51
  • 72
Bab
  • 177
  • 2
  • 6
  • 17

1 Answers1

7

You can use withColumn and split to get the required data

df1.withColumn("empname", split($"emp_details" (0), "=")(1))
  .withColumn("city", split($"emp_details" (1), "=")(1))
  .withColumn("zip", split($"emp_details" (2), "=")(1)) 

Output:

+---+----------------------------------+-------+----+-----+
|id |emp_details                       |empname|city|zip  |
+---+----------------------------------+-------+----+-----+
|1  |[empname=xxx, city=yyy, zip=12345]|xxx    |yyy |12345|
|2  |[empname=bbb, city=bbb, zip=22345]|bbb    |bbb |22345|
+---+----------------------------------+-------+----+-----+

UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as

val getColumnsUDF = udf((details: Seq[String]) => {
  val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
  (detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})

Now use the udf

df1.withColumn("emp",getColumnsUDF($"emp_details"))
 .select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
 .show(false)

Output:

+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1  |xxx    |xxx |xxx|
|2  |bbb    |bbb |bbb|
+---+-------+----+---+

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72
  • Thank you for quick response and I really appreciate the help. I have one more observation, if the values in an array always not in same order. For example, I may get one array as (city,zip,empname), other record I may get as (empname,city,zip). if these positions are changing dynamically, how can we read it? is there any option to read based on the name.. – Bab May 16 '18 at 04:38
  • @Bab please check the updated answer. Now it wont matter how the data is in array. – koiralo May 16 '18 at 04:48
  • Hi, For continuing above question, some times some key value pair may not come in the column(array of string) in a dataframe. How can we handle it dynamically. For example, in some records zip will be there or may not be there. Do we have any option to handle such situation. Thanks in Advance – Bab May 26 '18 at 01:32
  • That depends on you, You can return null or empty string by checking in the map. – koiralo May 26 '18 at 02:23
  • If you don't mind, can you give example to return null if key is not present. – Bab May 26 '18 at 07:46
  • Thank you for the response and could you please provide me the example on how to return null or empty if the key is not present in the map. – Bab May 28 '18 at 14:36
  • @Bab sorry for the late response, just change the last line of udf to `(detailsMap.getOrElse("empname", null), detailsMap.getOrElse("city", null),detailsMap.getOrElse("zip", null))` should work to return null – koiralo May 28 '18 at 16:31
  • @ShankarKoirala I have a similar situation however in my case Case a City or name could repeat several times. In such a case last string value is taken. How could I take it as array of elements ? – carl Jun 03 '18 at 17:42
  • @carl In that case this still works I guess, please test it, while converting to map it overwrites for the previous same key. – koiralo Jun 03 '18 at 17:47
  • Checked it, since it's Map old values get replaced instead of appending because they have same key. – carl Jun 03 '18 at 17:58
  • If you want all in the list you can groupby its first value and make a list of multiple values before converting to map – koiralo Jun 03 '18 at 18:01
  • @ShankarKoirala details.map(_.split("=")).map(x => (x(0), x(1))).groupBy(_._1).map { case (k,v) => (k,v.map(_._2))}.toMap did the trick for me. Thanks for the quick feedback. – carl Jun 03 '18 at 18:41
  • @carl Great you did it. – koiralo Jun 04 '18 at 03:15