I'll try my best to describe my situation and then I'm hoping another user on this site can tell me if the course I'm taking makes sense or if I need to reevaluate my approach/options.
Background:
I use pyspark since I am most familiar with python vs scala, java or R. I have a spark dataframe that was constructed from a hive table using pyspark.sql to query the table. In this dataframe I have many different 'files'. Each file is consists of timeseries data. I need to perform a rolling regression on a subset of the data, across the entire time values for each 'file'. After doing a good bit of research I was planning on creating a window object, making a UDF that specified how I wanted my linear regression to occur (using the spark ml linear regression inside the function), then returning the data to the dataframe. This would happen inside of the context of a .withColumn() operation. This made sense and I feel like this approach is correct. What I discovered is that currently pyspark does not support the ability to create UDAF (see the linked jira). So here is what I'm currently considering doing.
It is shown here and here that it is possible to create a UDAF in scala and then reference said function within the context of pyspark. Furthermore it is shown here that a UDAF (written in scala) is able to take multiple input columns (a necessary feature since I will be doing multiple linear regression - taking in 3 parameters). What I am unsure of is the ability for my UDAF to use org.apache.spark.ml.regression which I plan on using for my regression. If this can't be done, I could manually execute the operation using matrices (I believe, if scala allows that). I have virtually no experience using scala but am certainly motivated to learn enough to write this one function.
I'm wondering if anyone has insight or suggestions about this task ahead. I feel like after the research I've done, this is both possible and the appropriate course of action to take. However, I'm scared of burning a ton of time trying to make this work when it is fundamentally impossible or way more difficult than I could imagine.
Thanks for your insight.