I have the function below, where F.col("days") is an Integer column
def return_window(d):
days = lambda i: i * 86400 # converts to seconds
w = Window.partitionBy(
"a"
).orderBy(
F.col("date").cast("timestamp").cast("long")
).rangeBetween(1, days(d))
return w
df = df.withColumn(
"total_qty",
F.sum("qty").over(return_window(F.col("days")))
)
I'm getting the error in return_window
:
).rangeBetween(1, days(d))
in rangeBetween
if end >= Window._FOLLOWING_THRESHOLD:
File "/scratch/asset-install/6add7b36ad350d9f0c07885622f2e3ae/miniconda36/lib/python3.6/site-packages/pyspark/sql/column.py", line 907, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
How can I pass a column as an integer to use here? It works when I pass in a real int like F.sum("qty").over(return_window(5))
The df before doing the window function looks like (if today's 5/20)
a | date | qty | days
---------------------------------
'alpha'| 5/20 | 2 | 2
'alpha'| 5/21 | 5 | 1
'alpha'| 5/22 | 1 | 4
'alpha'| 5/23 | 4 | 7
'alpha'| 5/24 | 7 | 0
'alpha'| 5/25 | 8 | 0
'beta' | 5/20 | 2 | 2
'beta' | 5/23 | 9 | 5
'beta' | 5/24 | 2 | 1
'beta' | 5/25 | 3 | 6
'beta' | 5/27 | 7 | 3
This is what I'd like to add (total_qty is the sum(qty) over the next 1 - "days" days)
a | date | qty | days | total_qty
--------------------------------------------
'alpha'| 5/20 | 2 | 2 | 6
'alpha'| 5/21 | 5 | 1 | 1
'alpha'| 5/22 | 1 | 4 | 19
'alpha'| 5/23 | 4 | 7 | 15
'alpha'| 5/24 | 7 | 0 | 0
'alpha'| 5/25 | 8 | 0 | 0
'beta' | 5/20 | 2 | 2 | 0
'beta' | 5/23 | 9 | 5 | 12
'beta' | 5/24 | 2 | 1 | 3
'beta' | 5/25 | 3 | 6 | 7
'beta' | 5/27 | 7 | 3 | 0