I have seen similar questions to this but in scala or R.
This is about PySpark.
Imagine a table like this:
|date |account|balance|
|2019-01|1 |50 |
|2019-01|2 |23 |
|2019-01|3 |56 |
|2019-02|1 |12 |
|2019-02|2 |31 |
|2019-02|3 |90 |
|2019-03|1 |24 |
|2019-03|2 |67 |
|2019-03|3 |78 |
|-----------------------|
At first month, each account has a balance. Each next month the balance changes.
Is there a way to get something like this:
|date |account|balance|ratio|
|2019-01|1 |50 |null |
|2019-01|2 |23 |null |
|2019-01|3 |56 |null |
|2019-02|1 |12 |31 |
|2019-02|2 |31 |27 |
|2019-02|3 |90 |73 |
|2019-03|1 |24 |18 |
|2019-03|2 |67 |44 |
|2019-03|3 |78 |84 |
|-----------------------------|
The first three rows are null because there is not value for a previous month.
How do we get each value ?
Take for example 31 (the first available number in ratio column) 31 = (account 1 balance previous month + account 1 balance current month) / 2 31 = (50 + 12) / 2
Is there a way to write code for this in PySpark ?
Thanks a lot. More info can be given if I have not been clear :)
UPDATE!
For my care there are 50 accounts per month. I tried f.lag("balance", count=50)
but this puts values like this: 50 are null 50 are lagged. 50 are null 50 are lagged, etc.
Is there a way to lag tag the column by 50 rows ?
Here is my code
my_window = Window().partitionBy("date").orderBy("date")
df = df.withColumn("moved_balance", f.lag(provision_balance, 50).over(my_window))
I do not know what us the reason for partitionBy and orderBy on the window but it seems it needs those as I saw it from other questions