I hope you can help me with this. I have a DF as follows:
val df = sc.parallelize(Seq(
(1, "a", "2014-12-01", "2015-01-01", 100),
(2, "a", "2014-12-01", "2015-01-02", 150),
(3, "a", "2014-12-01", "2015-01-03", 120),
(4, "b", "2015-12-15", "2015-01-01", 100)
)).toDF("id", "prodId", "dateIns", "dateTrans", "value")
.withColumn("dateIns", to_date($"dateIns")
.withColumn("dateTrans", to_date($"dateTrans"))
I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates defined by the difference between the column 'dateIns' and 'dateTrans'. In particular, I would like to have a way to define a conditional sum that sums all values within a predefined max difference between the above mentioned columns. I.e. all value that happened between 10, 20, 30 days from dateIns ('dateTrans' - 'dateIns' <=10, 20, 30).
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)? I'm using pySpqrk, but very happy to get Scala solutions as well. Thanks a lot!