14

Looking to convert some R code to Sparklyr, functions such as lmtest::coeftest() and sandwich::sandwich(). Trying to get started with Sparklyr extensions but pretty new to the Spark API and having issues :(

Running Spark 2.1.1 and sparklyr 0.5.5-9002

Feel the first step would be to make a DenseMatrix object using the linalg library:

library(sparklyr)
library(dplyr)
sc <- spark_connect("local")

rows <- as.integer(2)
cols <- as.integer(2)
array <- c(1,2,3,4)

mat <- invoke_new(sc, "org.apache.spark.mllib.linalg.DenseMatrix", 
                  rows, cols, array)

This results in the error:

Error: java.lang.Exception: No matched constructor found for class org.apache.spark.mllib.linalg.DenseMatrix

Okay, so I got a java lang exception, I'm pretty sure the rows and cols args were fine in the constructor, but not sure sure about the last one, which is supposed to be a java Array. So I tried a few permutations of:

array <- invoke_new(sc, "java.util.Arrays", c(1,2,3,4))

but end up with a similar error message...

Error: java.lang.Exception: No matched constructor found for class java.util.Arrays

I feel like I'm missing something pretty basic. Anyone know what's up?

Zafar
  • 1,897
  • 15
  • 33

1 Answers1

13

R counterpart of the Java Array is list:

invoke_new(
  sc, "org.apache.spark.ml.linalg.DenseMatrix",
  2L, 2L, list(1, 2, 3, 4))

## <jobj[17]>
##   class org.apache.spark.ml.linalg.DenseMatrix
##   1.0  3.0  
## 2.0  4.0  

or

invoke_static(
  sc, "org.apache.spark.ml.linalg.Matrices", "dense",
  2L, 2L, list(1, 2, 3, 4))

## <jobj[19]>
##   class org.apache.spark.ml.linalg.DenseMatrix
##   1.0  3.0  
## 2.0  4.0 

Please note I am using o.a.s.ml.linalg instead of o.a.s.mllib.linalg. While mllib would work in isolation, as of Spark 2.x o.a.s.ml algorithms no longer accept local o.a.s.mllib.

At the same time R vector types (numeric, integer, character) are used as scalars.

Note:

Personally I believe this is not the way to go. Spark linalg packages are quite limited, and internally depend on the libraries, which won't be usable via sparklyr. Moreover sparklyr API is not suitable for complex logic.

In practice it makes more sense to implement Java or Scala extension, with a thin, R friendly wrapper.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • regarding your note, do you know of any resources for making these extensions? Also any guides showing how to call custom extensions from R? – Zafar Jun 25 '17 at 16:41
  • 2
    Sorry, I am not. There is of course [the official sparklyr guide](http://spark.rstudio.com/extensions.html), but I don't think it is that useful. Overall I think this more about API design. SparkR API is a nice example - with heavy logic implemented in Scala and thin, R-friendly adapters. – zero323 Jun 25 '17 at 17:07
  • I very much appreciate your comments. Looks like we'll be doing some Scala programming. I know that we'll need the `rank` linear algebra method and `linalg` does not have it. – Zafar Jun 26 '17 at 16:37
  • 1
    Yeah, if you need proper linear algebra libraries I would go directly to [Breeze](https://github.com/scalanlp/breeze/wiki/Quickstart). Design should be familiar enough if you worked with NumPy or Matlab, and it is already a dependency for Spark. – zero323 Jun 26 '17 at 18:55
  • Update many years later. We ended up going the suggested route and made a Spark package written in Scala. Then we just made an R package that calls that Spark library. – Zafar Mar 29 '22 at 19:51