Pyspark: Replace row-wise loop with groupby?

Question

I want to generate features by performing a group of operations based on a set of columns of a dataframe. MY dataframe looks like:

root
 |-- CreatedOn: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- Industry: string (nullable = true)
 |-- region: string (nullable = true)
 |-- Customer: string (nullable = true)

For eg. count number of times the ID and region were used in last 3/2/1 months. For this I have to scan the entire dataframe w.r.t to the current record. Current logic:

    1. for i in df.collect() - Row-wise collect.
    2. Filter the data 3 months before this row.
    3. Generate features.

The code is working fine but since it is a rowwise loop, it is running >10hrs. Is there any way I can replace the row-wise operation in Pyspark, since it is not leveraging the parallelism that pyspark provides. Something like groupby??

Sample data:

S.No    ID             CreatedOn          Industry  Region
1   ERP 05thMa2020  Communications  USA
2   ERP 28thSept2020    Communications  USA
3   ERP 15thOct2020 Communications   Europe
4   ERP 15thNov2020 Communications  Europe
5   Cloud   1stDec2020       Insurance  Europe

Consider record#4.. Feature 1 (Count_3monthsRegion): I want to see how many times ERP was used in Europe in the last 3 months (w.r.t CreatedOn). The answer will be 1. (Although record#2 is ERP but in same region)

Feature 2(Count_3monthsIndustry): I want to see how many times ERP was used in Communications in the last 3 months (w.r.t CreatedOn). The answer will be 2.

Expected output:

S.No    ID             CreatedOn          Industry  Region  Count_3monthsRegion Count_3monthsIndustry
1   ERP 05thMay2020 Communications  USA 0   0
2   ERP 28thSept2020    Communications  USA 0   0
3   ERP 15thOct2020 Communications   Europe 0   1
4   ERP 15thNov2020 Communications  Europe  1   2
5   Cloud   1stDec2020       Insurance  Europe  0   0

Using `collect` defeats the purpose of using Spark. Please provide more details of the logic you're trying to implement, and also provide some sample data and expected output. — mck, Mar 17 '21 at 09:39
Hello @mck, I have added the sample data and expected output. Hope that helps. — colab123, Mar 17 '21 at 10:46
You're probably looking for window functions over a time-range window: https://stackoverflow.com/a/33226511/14165730 — mck, Mar 17 '21 at 10:51

Pyspark: Replace row-wise loop with groupby?

0 Answers0