1

I am new to Spark programming .I am trying to explode column of DataFrame with empty row . I thought explode function in simple terms , creates additional rows for every element in array .But result is different .

I am not able to understand the logic behind the exploded DataFrame . Could someone please explain following example. I want to understand the underlying principle/cause for this result . Why is empty array considered as null in a dataframe ?

//inputDataFrame
+---+------+----------+
|age|  name|occupation|
+---+------+----------+
| []|Harish| developer|
+---+------+----------+

df.withColumn("age",explode(col("age")))

//DataFrame with age column exploded
+---+----+----------+
|age|name|occupation|
+---+----+----------+
+---+----+----------+

// expected DataFrame
    +---+------+----------+     +----+------+----------+
    |age|  name|occupation|     |age |  name|occupation|
    +---+------+----------+ (or)+----+------+----------+
    |   |Harish| developer|     |null|Harish| developer|
    +---+------+----------+     +----+------+----------+

EDIT1 : As per Chandan , I found this stack question Spark sql how to explode without losing null values and could understand the explode api available for spark2 . But I could not find proper explanation as for why the row was deleted .

Harish Gontu
  • 13
  • 1
  • 7
  • I have already read that question mentioned by shaido. I have also written in question edit1. But since it couldn't help me resolve my doubt I raised this question,as to why null object and empty array is considered same – Harish Gontu Sep 20 '18 at 05:45
  • Though the question seems to be duplicate , but I found better explanation with source code link attached by Chandan . Every answer told me that null objects are ignored but never mentioned why so I asked this question . Sorry for wasting your time and thanks for help – Harish Gontu Sep 22 '18 at 06:46

1 Answers1

6

That is the behaviour of the explode api. If you want to get the desired output use explode_outer

df.withColumn("age",explode_outer(col("age")))
Chandan Ray
  • 2,031
  • 1
  • 10
  • 15
  • 1
    Thank You Chandan . But I want to know the root cause of this issue / mechanism by which explode breaks apart an array to get additional rows or as to why my row got deleted . – Harish Gontu Sep 19 '18 at 18:28
  • 2
    It’s not an issue explode function is same as flat map for dataset. explode_outer generates the same output but only difference is that if array or map is null then it won’t ignore it and will generate the null for that column with other column in place. Please check the source code https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala – Chandan Ray Sep 19 '18 at 19:15
  • Thanks Chandan, your answer helped a lot – Harish Gontu Sep 20 '18 at 02:48
  • @HarishGontu please accept the answer if you find it useful – Chandan Ray Sep 20 '18 at 06:39
  • @HarishGontuThanks – Chandan Ray Sep 22 '18 at 08:32