Building a dataframe from multiple conditions applied to an initial dataframe : Is this case for pandas rather than pyspark?

Question

I have what I thought initially would be a trivial problem.

I have a dataset relating to a support ticketing system. For each ticket, I have the following columns open_date, last_updated_date, status (only open or closed), product ,customer.

What I want to build is a dataset that summarises the tickets status on a monthly basis. For each month I want to see the number of tickets open and closed in that month. That is relatively straightforward. I also want to identify the backlog, by age , at the end of each month. i.e. how many of the tickets open at the end of a month were < 30 days old, 30-60 days old, >60 days old.

So the columns would be :

month, product, customer, Number_opened, Number_Closed, Number<30_days, Number30-60_days,Number>60_days

The basic algorithm would be

for each ticket
    increment opened and closed values in corresponding months
    increment relevant backlog age columns in relevant months between opened month and closed month

I'm working on a Palantir system based on Spark so i thought that pyspark would be the sword of choice. However, this SO post how to loop through each row of dataFrame in pyspark seems to say that this is not something to do on Spark, as it is a distributed system that acts more on a columnar basis.

To loop per row, with .collect, would be equivalent to dumping the dataframe into memory, which in time will cause problems. Alternatively, I can export to a pandas dataframe and work that way, but in time, same problems.

Does anyone have another pyspark-esque approach in mind that could do what i'm looking for ?

Best regards,

Colm

P.S. I think that every support team generates this kind of stats so if someone know a magic module out there that already does this then feel free to radically brighten up my day.

score 1 · Answer 1 · answered Mar 11 '20 at 15:05

Maybe have a look at Koalas, a package that allows for the use of the pandas API on Spark data.

E.g.; from their documentation:

import databricks.koalas as ks
import pandas as pd

pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})

# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)

# Rename the columns
df.columns = ['x', 'y', 'z1']

# Do some operations in place:
df['x2'] = df.x * df.x

Building a dataframe from multiple conditions applied to an initial dataframe : Is this case for pandas rather than pyspark?

1 Answers1