0

I'll try my best to describe my situation and then I'm hoping another user on this site can tell me if the course I'm taking makes sense or if I need to reevaluate my approach/options.

Background:

I use pyspark since I am most familiar with python vs scala, java or R. I have a spark dataframe that was constructed from a hive table using pyspark.sql to query the table. In this dataframe I have many different 'files'. Each file is consists of timeseries data. I need to perform a rolling regression on a subset of the data, across the entire time values for each 'file'. After doing a good bit of research I was planning on creating a window object, making a UDF that specified how I wanted my linear regression to occur (using the spark ml linear regression inside the function), then returning the data to the dataframe. This would happen inside of the context of a .withColumn() operation. This made sense and I feel like this approach is correct. What I discovered is that currently pyspark does not support the ability to create UDAF (see the linked jira). So here is what I'm currently considering doing.

It is shown here and here that it is possible to create a UDAF in scala and then reference said function within the context of pyspark. Furthermore it is shown here that a UDAF (written in scala) is able to take multiple input columns (a necessary feature since I will be doing multiple linear regression - taking in 3 parameters). What I am unsure of is the ability for my UDAF to use org.apache.spark.ml.regression which I plan on using for my regression. If this can't be done, I could manually execute the operation using matrices (I believe, if scala allows that). I have virtually no experience using scala but am certainly motivated to learn enough to write this one function.

I'm wondering if anyone has insight or suggestions about this task ahead. I feel like after the research I've done, this is both possible and the appropriate course of action to take. However, I'm scared of burning a ton of time trying to make this work when it is fundamentally impossible or way more difficult than I could imagine.

Thanks for your insight.

Kevin Frankola
  • 29
  • 1
  • 1
  • 5
  • Hard to tell without more details about the calculation, and how you plan to use org.apache.spark.ml.regression. Could you provide more detail ? – Roberto Congiu Oct 17 '17 at 20:31
  • sure, here is the python code I wrote, tbh i'm not sure if it works since I can't test it: `def lrreg_int(sin_time, cos_time, prediction): lr = ml.regression.LinearRegression(predictionCol=prediction) model = lr.fit(sin_time, cos_time) intercept = model.intercept return intercept` – Kevin Frankola Oct 17 '17 at 20:41
  • Still not clear what you're trying to achieve. You say, rolling linear regression based on time series, it sounds like a streaming-oriented approach to regression, have you had a look at streaming-oriented mllib ? Based on what you said, this may be what you want http://spark.apache.org/docs/2.2.0/mllib-linear-methods.html#streaming-linear-regression – Roberto Congiu Oct 17 '17 at 22:06
  • so i have a table that has let's say a million rows. the table is made of files, let's say 500 files. each file has 2000 rows. a unique identifier is associated with each file and this has a 1 to many mapping. so for those 2000 rows associated with file 'a', there is a column 'filename' that has just the entry 'a'. so i make a window, partition by filename then order by time, then attempt to give a column of data to my udf while calling 'over' using the window object and this is where i discovered the problem. i hope that helps. the streaming is an interesting idea, i need to read more – Kevin Frankola Oct 18 '17 at 01:50
  • i forgot to add (and perhaps most important): my window also looks 15 rows above and 15 rows below the row that the operation needs to be applied to; i.e. i need to do one regression per point in the dataframe/table. this is why i need the windowing capability. and this is the 'rolling' part. im doing 1000000 regressions for my whole table looking at what is around each point. – Kevin Frankola Oct 18 '17 at 01:55
  • It is possible to write your own window functions, I actually just wrote one to 'sessionize' user activity. However, there's not much documentation about it, I basically reversed engineered spark's window functions and learned how they work. It's not easy at all but possible. – Roberto Congiu Oct 18 '17 at 20:06
  • could you share the code so i could learn how to do it? that's awesome though. – Kevin Frankola Oct 19 '17 at 15:30
  • I can't share it without permission since it's owned by the company I work for, but I can see if I can write something similar and share it. Will take a few days though. – Roberto Congiu Oct 19 '17 at 18:02
  • I understand. If you can write that though, that would help me immensely and i can't possibly thank you enough! seriously, thank you!! :) – Kevin Frankola Oct 19 '17 at 18:49
  • Allright, I wrote a blog posting on how to write custom UDWF here http://blog.nuvola-tech.com/2017/10/spark-custom-window-function-for-sessionization/ but as I said, it's not to simple. Also, sessionization in the example is probably easier to implement than regression. – Roberto Congiu Oct 26 '17 at 16:54
  • I'm going to read your blog post now in detail! I can't possibly thank you enough writing this, I sincerely want you to know how much you helped me. Thank you again!!! I wish there was something I could do for you or to say thanks, but I don't know what. I am going to share your blog post on my linkedin and other social networks - if that's fine. Really, thank you thank you thank you!! – Kevin Frankola Oct 26 '17 at 22:40

1 Answers1

0

After doing a good bit of research I was planning on creating a window object, making a UDF that specified how I wanted my linear regression to occur (using the spark ml linear regression inside the function

This cannot work, no matter if PySpark supports UDAF or not. You are not allowed to use distributed algorithms from UDF / UDAF.

Question is a bit vague, and it is not clear how much data you have but I'd consider using plain RDD with scikit-learn (or similar tool) or try to implement a whole thing from scratch.

  • "You are not allowed to use distributed algorithms from UDF /UDAF" so what you mean here is that I can't use the built in regression library? is that what you're referring to for the distributed algorithms? I considered doing it in RDDs but I'd probably have to go back and forth with my data quite a few times, but if that's an option I'll think about it. If what you're saying is that I can't use built in regression, I could definitely create my own mini function to handle it (at least in python) so it's good to know that's an option. Also thanks for giving me some concrete clarity – Kevin Frankola Oct 17 '17 at 21:21
  • Yes, this is what I mean. – user8792510 Oct 17 '17 at 21:22
  • Okay, so I write the regression code myself and otherwise what I'm suggesting makes sense and is possible? cool, I'll mark my answer solved then. Thanks!! – Kevin Frankola Oct 17 '17 at 21:59