1

Hi I want to add a new column to a dafaframe which contains the list of all column names(for that row) which are not null. How do I achieve this in Scala. Please help.

val baseDF = Seq(
(3, "California", "name1", 9846, null, "SFO"),
(1, "Oregon", "name2", 9847, null, null),
(2, null, null, null, null, null)
).toDF("emp_id", "emp_city", "emp_name", "emp_phone", "emp_sal", "emp_site")

Expected output is new column named "NonNullColumns" with expected non null column names for each row:

NonNullColumns 
==============
["emp_id", "emp_city", "emp_name", "emp_phone", "emp_site"]
["emp_id", "emp_city", "emp_name", "emp_phone"]
["emp_id"]
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
ark
  • 35
  • 7

2 Answers2

3

I've loaded data from csv, all fields as strings.

val cols = baseDF.schema.fieldNames.map(s=>when(col(s).isNotNull, s).otherwise(""))
df.select(cols:_*).select(array_remove(array('*),"").as("NonNullColumns")).show(false)

output:

+------+----------+--------+---------+-------+--------+
|emp_id|  emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+----------+--------+---------+-------+--------+
|     3|California|   name1|     9846|   null|     SFO|
|     1|    Oregon|   name2|     9847|   null|    null|
|     2|      null|    null|     null|   null|    null|
+------+----------+--------+---------+-------+--------+

+-------------------------------------------------+
|NonNullColumns                                   |
+-------------------------------------------------+
|[emp_id, emp_city, emp_name, emp_phone, emp_site]|
|[emp_id, emp_city, emp_name, emp_phone]          |
|[emp_id]                                         |
+-------------------------------------------------+
chlebek
  • 2,431
  • 1
  • 8
  • 20
2

Slight alternative using withColumn and reduce and using your DF I made them all String so as to avoid Any type issues, df used as name, and only relevant parts of code shown:

val nonNulls = df.columns.map(x => when(col(x).isNotNull, concat(lit(","), lit(x))).otherwise(",")).reduce(concat(_, _))
val df2 = df.withColumn("nonNulls", nonNulls) 
val df3 = df2.withColumn("nonNullsCols", array_remove(split(col("nonNulls"),","), lit(""))).drop("nonNulls")
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • +1 works well. Can you please explain the usage of a) reduce(concat(_, _)) and b) lit("") used during the split. @thebluephantom – ark Jun 01 '20 at 16:47
  • Also as a newbie to functional programming on data frames, I will take any suggestions on best way/links to learn the nuances of dataframe programming. Because I did spend a lot of time on this without getting nowhere. – ark Jun 01 '20 at 16:50
  • The lit("") is for null. reduce, concat just keeps on building up the string which will be later splt into an array. reduce is for numbers, but you can apply to strings as well. It ain't for everybody. I am an IT Solution Architect who happens to be a not so bad programmer and I have worked as Lead Data Engineer by necessity in Big Data. Enjoyed it. But best way is to look here at questions that have been posed. You can look at my questions and answers, those from pault, LeoC, blackbishop, Raphel Roth, et al. And try answering, that is how I keep myself sharp. Enjoy. – thebluephantom Jun 01 '20 at 17:09
  • Thank you. will do. – ark Jun 01 '20 at 17:39