0

I am using SQL with pyspark and hive, and I'm new to all of it. I have a hive table with a column of type string, like this:

id | values
1  | '2;4;4'
2  |  '5;1'
3  |  '8;0;4'

I want to create a query to obtain this:

id | values | sum
1  | '2.2;4;4'  | 10.2
2  |  '5;1.2' |  6.2
3  |  '8;0;4' | 12

By using split(values, ';') I can get arrays like ['2.2','4','4'], but I still need to convert them into decimal numbers and sum them. Is there a not too complicated way to do this?

Thank you so so much in advance! And happy coding to you all :)

GMB
  • 216,147
  • 25
  • 84
  • 135
Beatriz S
  • 25
  • 3

3 Answers3

2

From Spark-2.4+

We don't have to use explode on arrays but directly work on array's using higher order functions.

Example:

from pyspark.sql.functions import *

df=spark.createDataFrame([("1","2;4;4"),("2","5;1"),("3","8;0;4")],["id","values"])

#split and creating array<int> column
df1=df.withColumn("arr",split(col("values"),";").cast("array<int>"))

df1.createOrReplaceTempView("tmp")

spark.sql("select *,aggregate(arr,0,(x,y) -> x + y) as sum from tmp").drop("arr").show()
#+---+------+---+
#| id|values|sum|
#+---+------+---+
#|  1| 2;4;4| 10|
#|  2|   5;1|  6|
#|  3| 8;0;4| 12|
#+---+------+---+

#in dataframe API

df1.selectExpr("*","aggregate(arr,0,(x,y) -> x + y) as sum").drop("arr").show()
#+---+------+---+
#| id|values|sum|
#+---+------+---+
#|  1| 2;4;4| 10|
#|  2|   5;1|  6|
#|  3| 8;0;4| 12|
#+---+------+---+
notNull
  • 30,258
  • 4
  • 35
  • 50
1

PySpark solution

from pyspark.sql.functions import udf,col,split
from pyspark.sql.types import FloatType 
#UDF to sum the split values returning none when non numeric values exist in the string
#Change the implementation of the function as needed
def values_sum(split_list):
    total = 0
    for num in split_list:
        try:
            total += float(num)
        except ValueError:
            return None
    return total

values_summed = udf(values_sum,FloatType())
res = df.withColumn('summed',values_summed(split(col('values'),';')))
res.show()

The solution could've been a one-liner if it were known the array values are of a given data type. However, it is better to go with a safer implementation that covers all cases.

Hive solution

Use explode with split and group by to sum the values.

select id,sum(cast(split_value as float)) as summed
from tbl
lateral view explode(split(values,';')) t as split_value
group by id
Vamsi Prabhala
  • 48,685
  • 4
  • 36
  • 58
-1

write a stored procedure which does the job:

CREATE FUNCTION SPLIT_AND_SUM ( s VARCHAR(1024) ) RETURNS INT
BEGIN
   ...
END
draz
  • 793
  • 6
  • 10