I would like to sum
(or perform other aggregate functions too) on the array column using SparkSQL.
I have a table as
+-------+-------+---------------------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+---------------------------------+
| 10|Finance| [100, 200, 300, 400, 500]|
| 20| IT| [10, 20, 50, 100]|
+-------+-------+---------------------------------+
I would like to sum the values of this emp_details
column .
Expected query:
sqlContext.sql("select sum(emp_details) from mytable").show
Expected result
1500
180
Also I should be able to sum on the range elements too like :
sqlContext.sql("select sum(slice(emp_details,0,3)) from mytable").show
result
600
80
when doing sum on the Array type as expected it shows that sum expects argument to be numeric type not array type.
I think we need to create UDF for this . but how ?
Will I be facing any performance hits with UDFs ? and is there any other solution apart from the UDF one ?