I am having the pyspark dataframe (df) having below sample table (table1): id, col1, col2, col3 1, abc, null, def 2, null, def, abc 3, def, abc, null
I am trying to get new column (final) by appending the all the columns by ignoring null values. I have tried pyspark code and used f.array(col1, col2, col3). Values are getting appended but it not ignoring null values. I have also tried UDF to append only non null columns but it is not working.
import pyspark.sql.functions as f
df = spark.table('table1')
df = df.withColumn('final', f.array(col1,col2,col3))
Actual result:
id, col1, col2, col3, final
1, abc, null, def, [abc,,def]
2, null, def, abc, [,def, abc]
3, def, abc, null, [def,abc,,]
expected result:
id, col1, col2, col3, final
1, abc, null, def, [abc,def]
2, null, def, abc, [def, abc]
3, def, abc, null, [def,abc]
my col1, col2, col3 schema are as below:
where as col1 name is applications
applications: struct (nullable = false)
applicationid: string (nullable = true)
createdat: string (nullable = true)
updatedat: string (nullable = true)
source_name: string (nullable = true)
status: string (nullable = true)
creditbureautypeid: string (nullable = true)
score: integer (nullable = true)
applicationcreditreportid: string (nullable = true)
firstname: string (nullable = false)
lastname: string (nullable = false)
dateofbirth: string (nullable = false)
accounts: array (nullable = true)
element: struct (containsNull = true)
applicationcreditreportaccountid: string (nullable = true)
currentbalance: integer (nullable = true)
institutionid: string (nullable = true)
accounttypeid: string (nullable = true)
dayspastdue: integer (nullable = true)
institution_name: string (nullable = true)
account_type_name: string (nullable = true)
please let me know if question is not clear or any more info is required. Any help would be appreciated. :)