1

I have a DataFrame of orders (contactidid, orderdate, orderamount) and I want a new column that contains, for each order, the sum of all order amounts for the contact for the 12 months prior to this order. I am thinking the best way is to use the Windowing functions and the new INTERVAL ability in Spark >1.5.

But I'm having difficulty making this work or finding documentation. My best guess is:

 val dfOrdersPlus = dfOrders
   .withColumn("ORDERAMOUNT12MONTH", 
      expr("sum(ORDERAMOUNT) OVER (PARTITION BY CONTACTID ORDER BY ORDERDATE RANGE BETWEEN INTERVAL 12 months preceding and INTERVAL 1 day preceding)"));

But I get a RuntimeException: 'end of input expected'. Any ideas of what I am doing wrong with this 'expr' and where I could find documentation on the new INTERVAL literals?

pilot7
  • 58
  • 6

1 Answers1

1

As for now:

  • Window functions are not supported in the expr. To use window functions you'll have to either use DataFrame DSL or raw SQL on the registered table (Spark 1.5 and 1.6 only)

  • window functions support range intervals only on for numeric types. You cannot use DateType / TimestampType and date INTERVAL expressions. (Spark 1.5, 1.6, 2.0.0-preview)

If you want to use window functions with date or time columns you can convert these to Unix timestamps first. You'll find a full example in Spark Window Functions - rangeBetween dates.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935