0

I have seen similar questions to this but in scala or R.

This is about PySpark.

Imagine a table like this:

|date   |account|balance|
|2019-01|1      |50     |
|2019-01|2      |23     |
|2019-01|3      |56     |
|2019-02|1      |12     |
|2019-02|2      |31     |
|2019-02|3      |90     |
|2019-03|1      |24     |
|2019-03|2      |67     |
|2019-03|3      |78     |
|-----------------------|

At first month, each account has a balance. Each next month the balance changes.

Is there a way to get something like this:

|date   |account|balance|ratio|
|2019-01|1      |50     |null |
|2019-01|2      |23     |null |
|2019-01|3      |56     |null |
|2019-02|1      |12     |31   |
|2019-02|2      |31     |27   |
|2019-02|3      |90     |73   |
|2019-03|1      |24     |18   |
|2019-03|2      |67     |44   |
|2019-03|3      |78     |84   |
|-----------------------------|

The first three rows are null because there is not value for a previous month.

How do we get each value ?

Take for example 31 (the first available number in ratio column) 31 = (account 1 balance previous month + account 1 balance current month) / 2 31 = (50 + 12) / 2

Is there a way to write code for this in PySpark ?

Thanks a lot. More info can be given if I have not been clear :)

UPDATE!

For my care there are 50 accounts per month. I tried f.lag("balance", count=50)

but this puts values like this: 50 are null 50 are lagged. 50 are null 50 are lagged, etc.

Is there a way to lag tag the column by 50 rows ?

Here is my code

my_window = Window().partitionBy("date").orderBy("date")
df = df.withColumn("moved_balance", f.lag(provision_balance, 50).over(my_window))

I do not know what us the reason for partitionBy and orderBy on the window but it seems it needs those as I saw it from other questions

Toby Djelyinski
  • 128
  • 1
  • 11

0 Answers0