1

I created a set of algorithms and helpers in Scala for Spark working with different formats of measured data. They are all based on Hadoop's FileInputFormat. I also created some helpers to ease working with time series data from a Cassandra database. I now need some advanced functions which are already present in Thunder, plus some of my colleagues who are to work with these helper functions want to use Python. Is it somehow possible to use these helper functions from python or do I have to reimplement them?

I read through a lot of docs and only found that you can load extra jars with pyspark, but not how to use the functions.

rabejens
  • 7,594
  • 11
  • 56
  • 104
  • It is actually possible. – eliasah Feb 24 '16 at 13:12
  • @eliasah It depends, doesn't it? You can trigger high level transformations but it is not possible to the same thing from the worker. – zero323 Feb 24 '16 at 13:19
  • That's true ! I was thinking of the other way around like what I did [here](http://stackoverflow.com/a/33500704/3415409) – eliasah Feb 24 '16 at 13:26
  • So, if I created the "sc.coolMeasuringDataFile" via an implicit class, can I use that from pyspark and if yes, how do I do that? – rabejens Feb 24 '16 at 14:05

1 Answers1

2

"By accident" I found the solution: It is the "Java Gateway". This is not documented in the Spark documentation (at least I didn't find it).

Here is how it works, using a "GregorianCalendar" as an example

j = sc._gateway.jvm
cal = j.java.util.GregorianCalendar()
print cal.getTimeInMillis()

However, passing the SparkContext does not work directly. The Java SparkContext is in the _jsc field:

ref = j.java.util.concurrent.atomic.AtomicReference()
ref.set(sc)

this fails. However:

ref = j.java.util.concurrent.atomic.AtomicReference()
ref.set(sc._jsc)

works.

However note that sc._jsc returns a Java-based Spark Context, i.e., a JavaSparkContext. To get the original Scala SparkContext, you have to use:

sc._jsc.sc()
rabejens
  • 7,594
  • 11
  • 56
  • 104
  • Good one ! Nevertheless it isn't documented in Spark because it's not Spark related by rather Java/Python interoperability related – eliasah Feb 24 '16 at 16:29