Issue when concatenating ArrayType columns of a Spark DataFrame

Question

When I am trying to concate 3 ArrayType columns of a Spark DataFrame, I am getting erroneous outputs in some rows.

Since,some of the DataFrame have no values, so when they are combined - the output comes as [walmart, []] (for e.g.). I don't want the output to show those empty values. For e.g Dataframe has column name as concat_values and values are:-

[walmart, supercenter, walmart supercenter, [walmartsupercenter]]  
[walmart, []]  
[mobil, []] 
[[]]      
[dollar general]  
[marriott vacations, vacations worldwide, marriott vacations worldwide]

The output should be

[walmart, supercenter, walmart supercenter, [walmartsupercenter]]  
[walmart]  
[mobil] 
[]      
[dollar general]  
[marriott vacations, vacations worldwide, marriott vacations worldwide]

The UDF that I have implemented in the code is:-

from pyspark.sql.functions import col, udf
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql import functions as F

concat_string_arrays = F.udf(lambda w,x,y,z : w+x+y+z,ArrayType(StringType()))

Please help me with this. Thanks

Could you post [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples/48428198). It is not clear how you get `[walmartsupercenter]` and what are the types? Is it nested array, formatted string, can it occur on every position? — Alper t. Turker, May 02 '18 at 10:15
Hard to tell without a [mcve] but what if you change your `udf` to something like `F.udf(lambda w,x,y,z : [a for a in [w,x,y,z] if a], ArrayType(StringType()))`? But this won't work for nested arrays. — pault, May 02 '18 at 14:42

Issue when concatenating ArrayType columns of a Spark DataFrame

0 Answers0