0

Thanks in advance for any help on this. I am working on a project to do some system log anomaly detection on some very large data sets (we aggregate ~100gb per day of syslogs). The method/road we have chosen requires the need of singular decomposition value on a matrix of identifiers for each log message. As we progressed we found that Spark 2.2 provides a computeSVD function (we are using Python API - we are aware that this is available in Scala and Java, but our target is to use Python), but we are running Spark 2.1.1 (HortonWorks HDP 2.6.2 distribution). I asked about upgrading our 2.1.1 version in place, but the 2.2 version has not been tested against HDP yet.

We toyed with the idea of using Numpy straight from Python for this, but we are afraid we'll break the disinterestedness of Spark and possibly overload worker nodes by going outside of the Spark API. Are there any alternatives in the Spark 2.1.1 Python API for SVD? Any suggestion or pointers would greatly be appreciated. Thanks!

Another though I forgot about in the initial posting - is there a way we can write our machine learning primarily in the Python API, but maybe call that Scala function we need, return that result and continue with Python? I don't know if that is a thing or not....

azdatasci
  • 801
  • 1
  • 13
  • 32
  • Please do not amend the question in the comments - edit the original post instead – desertnaut Oct 27 '17 at 09:38
  • 1
    See answer by @eliasah here to compute SVD in PySpark < 2.2 https://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu – desertnaut Oct 27 '17 at 09:41
  • Looks promising, but I am struggling with a couple things. Since the post referenced Spark 1.5 and we are using 2.1.1, there is a little bit of an issue I'm trying to work around. We'd like to use pyspark.ml since that is newer than pyspark.mlib and you can just declare a dataframe of vectors, but even though most of the functionality lies in pyspark.ml, RowMatrix is actually work with pyspark.mlib and we need to feed it an RDD of vectors (that will work with RowMatrix), but an RDD of vectors then does not play nice with the PCA function that's being called (which wants pyspark.ml vectors). – azdatasci Oct 27 '17 at 17:24
  • Part II: So, it's like I can't use the pyspark.ml vectors since the don't load into RowMatricies, but we can't use pyspark.mlib vectors since hey don't work with the PCA functions called. Any ideas? Maybe we are missing something simple and I cannot see.. – azdatasci Oct 27 '17 at 17:26

1 Answers1

0

To bring this to a close, we ended up writing our own SVD function based on the example at:

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

There were some minor tweaks and I will post them as soon as we have them finalized, but overall it was the same. This was posted for Spark 1.5 and we are using Spark 2.1.1. However, it was noted that Spark 2.2 contains a computeSVD() function - unfortunately, at the time of the posting on this, the HDP distribution we are using did not support 2.2. Yesterday (11.1.2017), HDP 2.6.3 was announced and had support for Spark 2.2. Once we upgrade, we'll be converting the code to take advantage of the built-in computeSVD() function that Spark 2.2 provides. Thanks for all the help and pointers to the link above, they helped greatly!

azdatasci
  • 801
  • 1
  • 13
  • 32