I have to implement pandas .apply(function, axis=1) (to apply row wise function) in pyspark. As I am a novice, I am not sure if It can be implemented either through map function or using UDFs. I am not able to find any similar implementation anywhere.
Basically all I want is to pass a row to a function do some operations to create new columns which depend on the values of current and previous rows and then return modified rows to create a new dataframe. One of the function used with pandas is given below:
previous = 1
def row_operation(row):
global previous
if pd.isnull(row["PREV_COL_A"])==True or (row["COL_A"]) != (row["PREV_COL_A"]):
current = 1
elif row["COL_C"] > cutoff:
current = previous +1
elif row["COL_C"]<=cutoff:
current = previous
else:
current = Nan
previous = current
return current
Here PREV_COL_A is nothing but COL_A lagged by 1 row.
Please note that this function is the simplest and does not return rows however others do. If anyone can guide me on how to implement row operations in pyspark it would be a great help. TIA