0

I am dealing with spark dataframes with very long columns that represents time domain signals. Lots of millions of rows.

I need to perform signal processing on these using some functions from SciPy, which require me to input the columns as numpy arrays. Specifically I am trying to use these 2:

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.stft.html https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html

My current approach is to turn the column into a dataframe using collect as follows:

x = np.array(df.select("column_name").collect()).reshape(-1)

Then I feed the result to SciPy

However:

  1. This is very slow
  2. Collect loads all the data in the driver and therefore is not scalable

Please, could somebody help me find the most performing way to do this?

I found this somewhat old post, but there is no conclusion to it and I can find the following objections to it:

How to convert a pyspark dataframe column to numpy array

Objections to that post:

  1. It is not clear to me that Dask Arrays are compatible with SciPy, because Dask apparently only implements a subset of Numpy algorithms.
  2. The problem of how to convert from a PySpark dataframe to an array in the first place (using collect() currently) would not be solved by dask.

Please, help on this is much needed and would be greatly appreciated.

Many thanks in advance.

jj_coder
  • 33
  • 4
  • I know it is a bit too broad, but have you considered using https://spark.apache.org/mllib/ ? It is a machine learning library specifically designed to work on top of Spark and that takes into account its specifics (distributed computing, ...) – Coding thermodynamist Feb 15 '23 at 16:09
  • Thank you for your answer. I have, but many of the SciPy functions are simply not available. – jj_coder Feb 16 '23 at 08:58
  • Is there a way to group your data into smaller chunks so that you can apply SciPy function for each partition? For example, do you have independent timeseries? Then we can apply this function for each partition distributedly. – Kevin Kho Feb 17 '23 at 03:46
  • Thank you all four your interest. In the end I did not find a general approach. Still working on it. If somebody finds this an answer would be appreciated :) – jj_coder Mar 11 '23 at 15:17

0 Answers0