I have what I thought initially would be a trivial problem.
I have a dataset relating to a support ticketing system. For each ticket, I have the following columns open_date, last_updated_date, status (only open or closed), product ,customer.
What I want to build is a dataset that summarises the tickets status on a monthly basis. For each month I want to see the number of tickets open and closed in that month. That is relatively straightforward. I also want to identify the backlog, by age , at the end of each month. i.e. how many of the tickets open at the end of a month were < 30 days old, 30-60 days old, >60 days old.
So the columns would be :
month, product, customer, Number_opened, Number_Closed, Number<30_days, Number30-60_days,Number>60_days
The basic algorithm would be
for each ticket
increment opened and closed values in corresponding months
increment relevant backlog age columns in relevant months between opened month and closed month
I'm working on a Palantir system based on Spark so i thought that pyspark would be the sword of choice. However, this SO post how to loop through each row of dataFrame in pyspark seems to say that this is not something to do on Spark, as it is a distributed system that acts more on a columnar basis.
To loop per row, with .collect, would be equivalent to dumping the dataframe into memory, which in time will cause problems. Alternatively, I can export to a pandas dataframe and work that way, but in time, same problems.
Does anyone have another pyspark-esque approach in mind that could do what i'm looking for ?
Best regards,
Colm
P.S. I think that every support team generates this kind of stats so if someone know a magic module out there that already does this then feel free to radically brighten up my day.