Collect only not null columns of each row to an array

Question

The difficulty is is that I'm trying to avoid UDFs as much as possible.

I have a dataset "wordsDS", which contains many null values:

+------+------+------+------+
|word_0|word_1|word_2|word_3|
+------+------+------+------+
|     a|     b|  null|     d|
|  null|     f|     m|  null|
|  null|  null|     d|  null|
+--------------+------+-----|

I need to collect all of the columns for each row to array. I don't know the number of columns in advance, so I'm using columns() method.

groupedQueries = wordsDS.withColumn("collected",
      functions.array(Arrays.stream(wordsDS.columns())
               .map(functions::col).toArray(Column[]::new)));;

But this approach produces empty elements:

+--------------------+
|           collected|
+--------------------+
|           [a, b,,d]|
|          [, f, m,,]|
|            [,, d,,]|
+--------------------+

Instead, I need the following result:

+--------------------+
|           collected|
+--------------------+
|           [a, b, d]|
|              [f, m]|
|                 [d]|
+--------------------+

So basically, I need to collect all of the columns for each row to array with the following requirements:

Resulting array doesn't contain empty elements.
Don't know number of columns upfront.

I've also though of the approach of filter the dataset's "collected" column for empty values, but can't come up with anything else except UDF. I'm trying to avoid UDFs in order not to kill performance, if anyone could suggest a way to filter the dataset's "collected" column for empty values with as little overhead as possible, that would be really helpful.

If you want Spark SQL built-in functions, browse this link [StackOverflowQuestionLink](https://stackoverflow.com/questions/54159964/how-to-remove-nulls-with-array-remove-spark-sql-built-in-function). Try 'array_except' — Key.L, Aug 11 '21 at 08:36

score 3 · Answer 1 · answered Nov 07 '19 at 20:10

you can use array("*") to get all the elements into 1 array, then use array_except (needs Spark 2.4+) to filter out nulls:

df
  .select(array_except(array("*"),array(lit(null))).as("collected"))
  .show()

gives

+---------+
|collected|
+---------+
|[a, b, d]|
|   [f, m]|
|      [d]|
+---------+

score 0 · Answer 2 · answered Nov 08 '19 at 06:01

spark <2.0 you can use def to remove null

scala> var df = Seq(("a",  "b",  "null",  "d"),("null",  "f",  "m",  "null"),("null",  "null",  "d",  "null")).toDF("word_0","word_1","word_2","word_3")


scala> def arrayNullFilter = udf((arr: Seq[String]) => arr.filter(x=>x != "null"))

scala> df.select(array('*).as('all)).withColumn("test",arrayNullFilter(col("all"))).show
+--------------------+---------+
|                 all|     test|
+--------------------+---------+
|     [a, b, null, d]|[a, b, d]|
|  [null, f, m, null]|   [f, m]|
|[null, null, d, n...|      [d]|
+--------------------+---------+

hope this helps you.

score 0 · Answer 3 · edited Oct 31 '21 at 01:25

0

display(df_part_groups.withColumn("combined", F.array_except(F.array("*"), F.array(F.lit("null"))) ))

This statement doesn't remove the null. It keeps the distinct occurrences of null.

Use this instead:

display(df_part_groups.withColumn("combined", F.array_except(F.array("*"), F.array(F.lit(""))) ))

edited Oct 31 '21 at 01:25

Vladimir Vlasov

1,860
3
25
38

answered Oct 29 '21 at 19:08

NNM

358
1
10

Collect only not null columns of each row to an array

3 Answers3