0

I have a Pandas's code that calcul me the R2 of a linear regression over a window of size x. See my code :

def lr_r2_Sklearn(data):
    data = np.array(data)
    X = pd.Series(list(range(0,len(data),1))).values.reshape(-1,1)
    Y = data.reshape(-1,1)

    regressor = LinearRegression()  
    regressor.fit(X,Y)

    return(regressor.score(X,Y))

r2_rolling = df[['value']].rolling(300).agg([lr_r2_Sklearn])

I am making a rolling of size 300 and calcul the r2 for each window. I wish to do the exact same thing but with pyspark and a spark dataframe. I know I must use the Window function, but it's a bit more difficult to understand than pandas, so I am lost ...

I have this but I don't know how to make it works.

w = Window().partitionBy(lit(1)).rowsBetween(-299,0)
data.select(lr_r2('value').over(w).alias('r2')).show()

(lr_r2 return r2)

Thanks !

Emerois
  • 85
  • 2
  • 9

1 Answers1

2

You need a udf with pandas udf with a bounded condition. This is not possible until spark3.0 and is in development. Refer answer here : User defined function to be applied to Window in PySpark? However you can explore the ml package of pyspark: http://spark.apache.org/docs/2.4.0/api/python/pyspark.ml.html#pyspark.ml.classification.LinearSVC So you can define a model such as linearSVC and pass various parts of the dataframe to this after assembling it . I suggest using a pipeline consisting of stages, assembler and classifier, then call them in a loop using your various part of your dataframe by filtering it through some unique id.

Raghu
  • 1,644
  • 7
  • 19