I want to generate features by performing a group of operations based on a set of columns of a dataframe. MY dataframe looks like:
root
|-- CreatedOn: string (nullable = true)
|-- ID: string (nullable = true)
|-- Industry: string (nullable = true)
|-- region: string (nullable = true)
|-- Customer: string (nullable = true)
For eg. count number of times the ID and region were used in last 3/2/1 months. For this I have to scan the entire dataframe w.r.t to the current record. Current logic:
1. for i in df.collect() - Row-wise collect.
2. Filter the data 3 months before this row.
3. Generate features.
The code is working fine but since it is a rowwise loop, it is running >10hrs. Is there any way I can replace the row-wise operation in Pyspark, since it is not leveraging the parallelism that pyspark provides. Something like groupby??
Sample data:
S.No ID CreatedOn Industry Region
1 ERP 05thMa2020 Communications USA
2 ERP 28thSept2020 Communications USA
3 ERP 15thOct2020 Communications Europe
4 ERP 15thNov2020 Communications Europe
5 Cloud 1stDec2020 Insurance Europe
Consider record#4.. Feature 1 (Count_3monthsRegion): I want to see how many times ERP was used in Europe in the last 3 months (w.r.t CreatedOn). The answer will be 1. (Although record#2 is ERP but in same region)
Feature 2(Count_3monthsIndustry): I want to see how many times ERP was used in Communications in the last 3 months (w.r.t CreatedOn). The answer will be 2.
Expected output:
S.No ID CreatedOn Industry Region Count_3monthsRegion Count_3monthsIndustry
1 ERP 05thMay2020 Communications USA 0 0
2 ERP 28thSept2020 Communications USA 0 0
3 ERP 15thOct2020 Communications Europe 0 1
4 ERP 15thNov2020 Communications Europe 1 2
5 Cloud 1stDec2020 Insurance Europe 0 0