0

I want to generate features by performing a group of operations based on a set of columns of a dataframe. MY dataframe looks like:

root
 |-- CreatedOn: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- Industry: string (nullable = true)
 |-- region: string (nullable = true)
 |-- Customer: string (nullable = true)

For eg. count number of times the ID and region were used in last 3/2/1 months. For this I have to scan the entire dataframe w.r.t to the current record. Current logic:

    1. for i in df.collect() - Row-wise collect.
    2. Filter the data 3 months before this row.
    3. Generate features.

The code is working fine but since it is a rowwise loop, it is running >10hrs. Is there any way I can replace the row-wise operation in Pyspark, since it is not leveraging the parallelism that pyspark provides. Something like groupby??

Sample data:

S.No    ID             CreatedOn          Industry  Region
1   ERP 05thMa2020  Communications  USA
2   ERP 28thSept2020    Communications  USA
3   ERP 15thOct2020 Communications   Europe
4   ERP 15thNov2020 Communications  Europe
5   Cloud   1stDec2020       Insurance  Europe

Consider record#4.. Feature 1 (Count_3monthsRegion): I want to see how many times ERP was used in Europe in the last 3 months (w.r.t CreatedOn). The answer will be 1. (Although record#2 is ERP but in same region)

Feature 2(Count_3monthsIndustry): I want to see how many times ERP was used in Communications in the last 3 months (w.r.t CreatedOn). The answer will be 2.

Expected output:

S.No    ID             CreatedOn          Industry  Region  Count_3monthsRegion Count_3monthsIndustry
1   ERP 05thMay2020 Communications  USA 0   0
2   ERP 28thSept2020    Communications  USA 0   0
3   ERP 15thOct2020 Communications   Europe 0   1
4   ERP 15thNov2020 Communications  Europe  1   2
5   Cloud   1stDec2020       Insurance  Europe  0   0
colab123
  • 81
  • 1
  • 6

0 Answers0