Thanks in advance for any help on this. I am working on a project to do some system log anomaly detection on some very large data sets (we aggregate ~100gb per day of syslogs). The method/road we have chosen requires the need of singular decomposition value on a matrix of identifiers for each log message. As we progressed we found that Spark 2.2 provides a computeSVD function (we are using Python API - we are aware that this is available in Scala and Java, but our target is to use Python), but we are running Spark 2.1.1 (HortonWorks HDP 2.6.2 distribution). I asked about upgrading our 2.1.1 version in place, but the 2.2 version has not been tested against HDP yet.
We toyed with the idea of using Numpy straight from Python for this, but we are afraid we'll break the disinterestedness of Spark and possibly overload worker nodes by going outside of the Spark API. Are there any alternatives in the Spark 2.1.1 Python API for SVD? Any suggestion or pointers would greatly be appreciated. Thanks!
Another though I forgot about in the initial posting - is there a way we can write our machine learning primarily in the Python API, but maybe call that Scala function we need, return that result and continue with Python? I don't know if that is a thing or not....