0

I have a Dataframe which has following data(simulation).

Price Date InvoiceNumber Product_Type
12.65 12/30/2021 INV_19984 AXN UN1234
18.78 1/23/2022 INV_200174 AXN UN1234
11.78 1/25/2022 INV_200173 AXN UN1234
11.1 3/2/2022 INV_9912 AXN UN1234

I am trying to find minimum price of the product in the last 30days, 60days and 90days without filtering the data. This data is huge and hence avoiding join operation so using a window function. But It returns null.

wi = Window.partitionBy("Product_Type").orderBy("Date").rowsBetween(30 ,Window.currentRow)
f1 = df.withColumn("minPrice30",F.min("price").over(wi))

Expected output is :

minPrice60 minPrice30 Price Date InvoiceNumber Product_Type
12.65 12.65 12.65 12/30/2021 INV_19984 AXN UN1234
12.65 12.65 18.78 1/23/2022 INV_200174 AXN UN1234
11.78 11.78 11.78 1/25/2022 INV_200173 AXN UN1234
11.1 11.1 11.1 3/2/2022 INV_9912 AXN UN1234

Is there any way we can achieve this within the date range? Kindly Suggest

Ramaraju.d
  • 1,301
  • 6
  • 26
  • 46
  • `rowsBetween` is based on the # of rows in dataframe and `rowsBetween(30, currentRow)` means you are aggregating between `((current row+30)th row) and (current row)` which doesn't exist. `rowsBetween(-30, currentRow)` or `rowsBetween(currentRow, 30)` would make a valid range but this is again based on the # of rows. What you are looking for is window based on time interval. For that, you should use `rangeBetween`. Check this for how to use it. https://stackoverflow.com/questions/33207164/spark-window-functions-rangebetween-dates – Emma Jan 11 '23 at 15:26
  • @Emma Indeed I tried rangebetween option like .rangeBetween(60,currentrow) since I am comparing against Date and 60 being int it was throwing error. I just tried in spark sql way and looks like it is working. I missed to use Interval keyword with range in my earlier try. Thanks for showing me the right way to use it :-) – Ramaraju.d Jan 11 '23 at 15:40
  • 2
    `currentRow = 0th`, so `rangeBetween(60th, 0th)` is not a valid range. you need `rangeBetween(-60th, 0th)`. And also you need to `unix_date` or `datediff` in the `orderBy` to convert the datetime type to int. Check your current data type for "Date" and use the corresponding one in here https://stackoverflow.com/a/72854840/2956135. – Emma Jan 11 '23 at 15:43

0 Answers0