1

I have similar data and issue to the questions asked here: Spark sql how to explode without losing null values

I have used the solution proposed for Spark <=2.1 and indeed the null values appaear as literals in my data after the split:

df.withColumn("likes", explode(
  when(col("likes").isNotNull, col("likes"))
    // If null explode an array<string> with a single null
    .otherwise(array(lit(null).cast("string")))))

The issue is that after that I need to check if there are null values in that column and take an action in that case. Wehn I try to run my code, the nulls inserted as literals as recognized as string instead of null values.

So this code below will always return 0 even if the row has a null in that column:

df.withColumn("likes", f.when(col('likes').isNotNull(), 0).otherwise(2)).show()

+--------+------+
|likes   |origin|
+--------+------+
|    CARS|     0|
|    CARS|     0|
|    null|     0|
|    null|     0|

I use cloudera pyspark

DroppingOff
  • 331
  • 3
  • 17

2 Answers2

1

You could hack this, by using an udf:

val empty = udf(() => null: String)

df.withColumn("likes", explode(
  when(col("likes").isNotNull, col("likes"))
    // If null explode an array<string> with a single null
    .otherwise(array(empty()))))
  • Hi, thanks. I just copy-paste your function and I get an error: File "", line 1:undefined val empty = udf(() => null: String) ^ SyntaxError: invalid syntax. Perhaps it does not work with all the versions? I am using pyspark – DroppingOff Oct 16 '18 at 12:25
  • Hi, I don't know why udf does not work for me but I found another way and answer here. I will mark your answer as good anyway so that it can help others. Thanks – DroppingOff Oct 16 '18 at 13:20
0

I actually found a way. In the otherwise have to write this:

.otherwise(array(lit(None).cast("string")))))

DroppingOff
  • 331
  • 3
  • 17