2

I am currently working with a dataframe right now in scala, and can't figure out how to fill a column with a Seq.empty[Row] value if the value in that row is null. I understand there is the df.na.fill function, however it seems to only handle strings, doubles, or ints from my understanding.

Essentially, I want to be able to fill the null row with a Seq.empty[Row], for a nested field. I'm sure this is possible, however it seems like I can't find anything like it online.

Table example --

+-----------+-------------------+-----------+---------+------------------+ | field| field2| | values| +-----------+-------------------+-----------+----------+------------------+ | stuff2| stuff 2| | null | | stuff1| stuff1| | [val 1,val2,..]| +-----------+-------------------+-----------+----------+------------------+

Essentially, I want that null to turn into an empty array like so: []. Any help would be appreciated, thank you.

Daniel Dao
  • 830
  • 1
  • 7
  • 11

1 Answers1

3

You can do it using udf and coalesce like below.

val df = Seq(("stuff2","stuff2",null), ("stuff2","stuff2",Array("value1","value2")),("stuff3","stuff3",Array("value3"))).toDF("field","field2","values")
df.show()

import org.apache.spark.sql.functions.udf
val array_ = udf(() => Array.empty[String])

val df2 = df.withColumn("values", coalesce(df("values"), array_()))
df2.show()
abaghel
  • 14,783
  • 2
  • 50
  • 66
  • There's a lot of difficulty in converting a `null` to empty `Sequence` of `Row` perfectly, because `coalesce` is dependent on the type of the column being correct. I'm having trouble inferring the type of the column effectively, and I think that's a different question. However your point still stands, so thank you for the answer. – Daniel Dao Dec 22 '16 at 15:24
  • Clever hack using `udf` to inject non-literal value! – metasim Apr 24 '17 at 12:26